How could a compiler break the lockless hashing method?

syzygy · Post by **syzygy** » Fri Dec 13, 2013 9:38 pm

wgarvin wrote:
bob wrote:BTW HOW can a 16 byte store be guaranteed to be done atomically when the bus only transfers 8 bytes at a time? I'm guessing there could be some trickery in the cache controllers, but I have not read anything about that happening.
If I ever knew the answer to this question, I have forgotten it. I'm kind of curious about that myself though... no promises but I might dig around in some docs and try to figure it out. If you find the answer somewhere, please do post it.

I may be proven wrong, but I'm reasonably sure that Intel does not guarantee 16-byte writes into the same cacheline to be atomic. See this old post, for which I probably consulted the Intel manual.

It might be that a particular x86 cpu does in fact atomically execute such instructions, but relying on it is equivalent to relying on a particular implementation of UB

You actually confirmed this:

Storing 16 bytes with two or more instructions is never atomic, you have no control over when the writes reach the cache or main memory or even other cores. Cache misses, paging or thread switching can make the gap between two reads or two writes arbitrarily long.

Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

With lock cmpxchg16b it is possible to implement atomic 16-byte read and writes.

Daniel Shawul · Post by **Daniel Shawul** » Fri Dec 13, 2013 9:39 pm

I copied the quote from this stackoverflow thread,

Why not go to stackOVERFLOW instead of copying and referencing many of its contents?

bob · Post by **bob** » Fri Dec 13, 2013 9:42 pm

wgarvin wrote:
bob wrote:If you talk about SSE I would have to somehow return to 2009 to understand my concern. Perhaps that SSE is not exactly standard across all platforms, not being on some, being on others. And it would also not solve my problem because I have a pawn hash table that is much longer than 16 bytes. BSF/BSR are significant speed gains. storing 16 bytes vs the xor trick is not a speed gain that is measurable, which also would make me consider such carefully. I'm willing to do what is needed for speed. the XOR trick provides safety with no measurable speed cost, therefore it was a natural decision.

BTW I might actually reconsider that decision today, because today I only run on X86 anyway, nothing else, unlike a few years ago.
I'm just guessing, but I think maybe your objection to it was that there was no convenient/portable way to access it. The necessary intrinsics are better-supported by compilers now, so maybe its worth trying. But the XOR trick works well enough in practice, so perhaps there's no point anyway.

bob wrote: BTW HOW can a 16 byte store be guaranteed to be done atomically when the bus only transfers 8 bytes at a time? I'm guessing there could be some trickery in the cache controllers, but I have not read anything about that happening.
If I ever knew the answer to this question, I have forgotten it. I'm kind of curious about that myself though... no promises but I might dig around in some docs and try to figure it out. If you find the answer somewhere, please do post it.

I could sort of see this if there was a guarantee that the 16 byte chunk was on a 16 byte boundary. Since that now lands in one cache block and this is going to be a cache-implemented deal anyway. Both thread have to "read for ownership/exclusivity/etc" so that one cache ends up with the working copy, then some sort of mechanism to go "splat-splat" and write out 16 bytes before that block gets forwarded elsewhere and then invalidated. But I have not seen any such cache implementation mentioned regarding X86..

I'm also interested, and will look around. Only issue might be the compiler. Can I safely try to grab 16 bytes in one gulp without the compiler making some silly assumption?

Not so sure any more.

OK, I am now pretty convinced this won't work. You can do a google on 16 byte atomic writes x86 and find several threads with sample tests and such. The docs I have quote 8 bytes max assuming 8 byte alignment. That's what others are quoting in their docs as well. So perhaps that won't work. It certainly fails on many different processors, one of the thread has some code posted that is pretty simple to follow... and it breaks. Or not. Depends on the timing...

wgarvin · Post by **wgarvin** » Fri Dec 13, 2013 9:42 pm

Daniel Shawul wrote:
I copied the quote from this stackoverflow thread,
Why not go to stackOVERFLOW instead of copying and referencing many of its contents?

Because the people I want to talk to are here. Why not get yourself banned again, so the rest of us can have our discussions in peace?

wgarvin · Post by **wgarvin** » Fri Dec 13, 2013 9:45 pm

syzygy wrote:I may be proven wrong, but I'm reasonably sure that Intel does not guarantee 16-byte writes into the same cacheline to be atomic. See this old post, for which I probably consulted the Intel manual.

It might be that a particular x86 cpu does in fact atomically execute such instructions, but relying on it is equivalent to relying on a particular implementation of UB

Yup, I think it is NOT actually guaranteed to be atomic, and Intel docs warn about it, see my previous post which crossed yours

I must have been imagining things back in that 2009 thread.

Daniel Shawul · Post by **Daniel Shawul** » Fri Dec 13, 2013 9:51 pm

wgarvin wrote:
Daniel Shawul wrote:
I copied the quote from this stackoverflow thread,
Why not go to stackOVERFLOW instead of copying and referencing many of its contents?
Because the people I want to talk to are here. Why not get yourself banned again, so the rest of us can have our discussions in peace?

You came to my stackOVERFLOW thread and suggested I should be banned for it. And then 2 seconds later, here you are refering stackOVERFLOW. So what does that make you?

wgarvin · Post by **wgarvin** » Fri Dec 13, 2013 9:55 pm

Daniel Shawul wrote:
wgarvin wrote:
Daniel Shawul wrote:
I copied the quote from this stackoverflow thread,
Why not go to stackOVERFLOW instead of copying and referencing many of its contents?
Because the people I want to talk to are here. Why not get yourself banned again, so the rest of us can have our discussions in peace?
You came to my stackOVERFLOW thread and suggested I should be banned for it. And then 2 seconds later, here you are refering stackOVERFLOW. So what does that make you?

Ah, sorry, I did not mean to suggest in that post that you should be banned for it. I was pointing out the fact that you already had been banned for it.

After your ban expired you posted it yet again so it seemed highly likely to me that they would ban you for it again, for the same reasons as the first time. Anyway, you didn't answer my question over there yet. I'll read your answer if you do, but if you aren't interested in answering it no worries, thats fine.

Daniel Shawul · Post by **Daniel Shawul** » Fri Dec 13, 2013 9:58 pm

wgarvin wrote:
Daniel Shawul wrote:
wgarvin wrote:
Daniel Shawul wrote:
I copied the quote from this stackoverflow thread,
Why not go to stackOVERFLOW instead of copying and referencing many of its contents?
Because the people I want to talk to are here. Why not get yourself banned again, so the rest of us can have our discussions in peace?
You came to my stackOVERFLOW thread and suggested I should be banned for it. And then 2 seconds later, here you are refering stackOVERFLOW. So what does that make you?
Ah, sorry, I did not mean to suggest in that post that you should be banned for it. I was pointing out the fact that you already had been banned for it.

After your ban expired you posted it yet again so it seemed highly likely to me that they would ban you for it again, for the same reasons as the first time. Anyway, you didn't answer my question over there yet. I'll read your answer if you do, but if you aren't interested in answering it no worries, thats fine.

Seems you never get tired of being owned

rbarreira · Post by **rbarreira** » Sat Dec 14, 2013 12:03 am

bob wrote:I suppose that next you are going to tell me you want some of the "mistake-prone" things removed from assembly language so that you can't write code there with overflow, etc?

No, I am quite consistent in that aspect - I want both C and x86 Assembly to remain backwards compatible. What made you think otherwise? Was it the fact that I, unlike you, do not mind C compilers getting better at optimizing as long as they don't violate the standard?

Your attempt to pretend it was me ever me asking for "mistake-prone things to be removed" is pathetic.

bob · Post by **bob** » Sat Dec 14, 2013 3:45 am

syzygy wrote:
wgarvin wrote:
bob wrote:BTW HOW can a 16 byte store be guaranteed to be done atomically when the bus only transfers 8 bytes at a time? I'm guessing there could be some trickery in the cache controllers, but I have not read anything about that happening.
If I ever knew the answer to this question, I have forgotten it. I'm kind of curious about that myself though... no promises but I might dig around in some docs and try to figure it out. If you find the answer somewhere, please do post it.
I may be proven wrong, but I'm reasonably sure that Intel does not guarantee 16-byte writes into the same cacheline to be atomic. See this old post, for which I probably consulted the Intel manual.

It might be that a particular x86 cpu does in fact atomically execute such instructions, but relying on it is equivalent to relying on a particular implementation of UB

You actually confirmed this:
Storing 16 bytes with two or more instructions is never atomic, you have no control over when the writes reach the cache or main memory or even other cores. Cache misses, paging or thread switching can make the gap between two reads or two writes arbitrarily long.

Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.
With lock cmpxchg16b it is possible to implement atomic 16-byte read and writes.

I'm not even sure that will work. I think that in searching thru posts today after Wylie's post, there were citations from the intel docs about that not being guaranteed. It might have been related to "if it overlaps two cache blocks it won't work or something similar." But I studied the most recent Intel stuff on their web site and to say it is somewhat contradictory in that regard is an understatement.

I'm just having a hard time thinking about how the CPU can get the cache to accept two consecutive writes without responding to any other read-for-ownership type requests on other caches. No one posted a code that would work for all tests, however. I think one example of a 16 byte store worked on cores within one chip, but not across chips...

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?