How could a compiler break the lockless hashing method?

wgarvin · Post by **wgarvin** » Sat Dec 14, 2013 4:02 am

Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.

I think the XOR trick, theoretical race condition and all,

:lol: is probably going to remain the best option, at least for the forseeable future!

syzygy · Post by **syzygy** » Sat Dec 14, 2013 2:45 pm

wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.

The lock wasn't a typo, but with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.

SSE instructions that write 16 bytes do not accept a lock prefix.

Older AMD64 processors don't support cmpxchg16b, but that did not stop Microsoft from using it in (the 64-bit version of) Windows 8.1.

I think the XOR trick, theoretical race condition and all, :lol: is probably going to remain the best option, at least for the forseeable future!

I'll just continue ignoring possible smp corruption (and the inherent UB).

wgarvin · Post by **wgarvin** » Sat Dec 14, 2013 3:55 pm

syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.

Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.

I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)

bob · Post by **bob** » Sat Dec 14, 2013 5:22 pm

wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.

I think the XOR trick, theoretical race condition and all, :lol: is probably going to remain the best option, at least for the forseeable future!

No, the lock prefix was discussed in one of the posts. It appeared from what was quoted via Intel that this did not exactly guarantee the 16 byte atomic operation.

The lock prefix enables a clever cache trick where a cache says "I want this block, and I am going to modify it. Which appears to give that cache a "locked access" but for a single 8 byte access, where the cpu says "give me the old value, here is the new value". If it is for more than 8 bytes, it appeared, at least based on the discussion I was reading, that it was not guaranteed to be atomic.

I'm going to read the Intel docs more carefully once I finish grading. Grades are due Monday so that has to come first...

bob · Post by **bob** » Sat Dec 14, 2013 5:23 pm

syzygy wrote:
wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.
The lock wasn't a typo, but with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.

SSE instructions that write 16 bytes do not accept a lock prefix.

Older AMD64 processors don't support cmpxchg16b, but that did not stop Microsoft from using it in (the 64-bit version of) Windows 8.1.

I think the XOR trick, theoretical race condition and all, :lol: is probably going to remain the best option, at least for the forseeable future!
I'll just continue ignoring possible smp corruption (and the inherent UB).

The UB is STILL there, apparently.

bob · Post by **bob** » Sat Dec 14, 2013 5:25 pm

wgarvin wrote:
syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.

I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)

Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)

syzygy · Post by **syzygy** » Sat Dec 14, 2013 6:30 pm

bob wrote:
syzygy wrote:
wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.
The lock wasn't a typo, but with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.

SSE instructions that write 16 bytes do not accept a lock prefix.

Older AMD64 processors don't support cmpxchg16b, but that did not stop Microsoft from using it in (the 64-bit version of) Windows 8.1.

I think the XOR trick, theoretical race condition and all, :lol: is probably going to remain the best option, at least for the forseeable future!
I'll just continue ignoring possible smp corruption (and the inherent UB).
The UB is STILL there, apparently.

In the sense that C99 doesn't guarantee the behaviour that I expect, yes. How could it, C99 does not know about threads.

More serious is that even the pthreads standard, which I can rely on because my development and runtime environment supports it, does not guarantee the behaviour that I expect.

I accept the risks involved in this.

syzygy · Post by **syzygy** » Sat Dec 14, 2013 6:33 pm

bob wrote:
wgarvin wrote:
syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.

I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)

Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.

syzygy · Post by **syzygy** » Sat Dec 14, 2013 6:38 pm

wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)

I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.

I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.

In the uncontested case, the overhead of the lock prefix does not increase with the number of threads.

wgarvin · Post by **wgarvin** » Sat Dec 14, 2013 6:47 pm

bob wrote: Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)

Yeah, I was reading one of the old threads about it, recently. Spinning on a regular read is much cheaper, but I've started to wonder if at least in C11, a C implementation of that, is UB under the same rules as the XOR trick! I mean, its a data race, same as the XOR case is. The confirmation with a locked atomic op makes it safe in practice, but I'm really starting to wish those standards would declare cases like that to be undefined values instead of undefined behavior, because a value can be overwritten with something else in a robust way, but UB at least theoretically allows the nasal demons, so unless our compiler makes a stronger guarantee, we always have to worry.

Maybe part of the problem is that not all binary bit-patterns are safely useable for some C types, e.g. floats, or even signed ints. Loading unsafe values into a register is usually not going to trap, but any manipulation of them might trap on some hardware, or even put it into an unreliable state (jf it was really badly designed...)

So a float read data race (on some weird platform other than x86) might result in any bit pattern being read including ones that the hardware might trap or fault on... Bit patterns the compiled code might normally not expose it to.

Obviously on x86 there aren't problems of that sort, other than FP exceptions and signalling NaNs. Certainly any integer value is OK. But because it was standardized as UB, now compilers are theoretically allowed to "do evil" with it, which they might not do deliberately but could--perhaps--someday happen while they're trying to do some sort of optimization I am not smart enough to dream up.

The least they ought to have done, was define some semantics for it with unsigned integer types, providing that it might return ANY bit pattern, but would not trap or otherwise blow up. I have a hard time imagining a C-capable platform that could not meet that requirement.

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?