I think the XOR trick, theoretical race condition and all,

Moderator: Ras
The lock wasn't a typo, but with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.
I'll just continue ignoring possible smp corruption (and the inherent UB).I think the XOR trick, theoretical race condition and all,:lol: is probably going to remain the best option, at least for the forseeable future!
Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
No, the lock prefix was discussed in one of the posts. It appeared from what was quoted via Intel that this did not exactly guarantee the 16 byte atomic operation.wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.
I think the XOR trick, theoretical race condition and all,:lol: is probably going to remain the best option, at least for the forseeable future!
The UB is STILL there, apparently.syzygy wrote:The lock wasn't a typo, but with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.
SSE instructions that write 16 bytes do not accept a lock prefix.
Older AMD64 processors don't support cmpxchg16b, but that did not stop Microsoft from using it in (the 64-bit version of) Windows 8.1.
I'll just continue ignoring possible smp corruption (and the inherent UB).I think the XOR trick, theoretical race condition and all,:lol: is probably going to remain the best option, at least for the forseeable future!
Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)wgarvin wrote:Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
In the sense that C99 doesn't guarantee the behaviour that I expect, yes. How could it, C99 does not know about threads.bob wrote:The UB is STILL there, apparently.syzygy wrote:The lock wasn't a typo, but with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.wgarvin wrote:Ah... Ronald said "wth lock cmpxchg16b", just in case you were reading it as "with cmpxchg16b". Unless it was a typo, but I think it isn't. So we're still talking about locking the bus for the entire op and that would probably not be a good thing.
SSE instructions that write 16 bytes do not accept a lock prefix.
Older AMD64 processors don't support cmpxchg16b, but that did not stop Microsoft from using it in (the 64-bit version of) Windows 8.1.
I'll just continue ignoring possible smp corruption (and the inherent UB).I think the XOR trick, theoretical race condition and all,:lol: is probably going to remain the best option, at least for the forseeable future!
Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.bob wrote:Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)wgarvin wrote:Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
Yeah, I was reading one of the old threads about it, recently. Spinning on a regular read is much cheaper, but I've started to wonder if at least in C11, a C implementation of that, is UB under the same rules as the XOR trick! I mean, its a data race, same as the XOR case is. The confirmation with a locked atomic op makes it safe in practice, but I'm really starting to wish those standards would declare cases like that to be undefined values instead of undefined behavior, because a value can be overwritten with something else in a robust way, but UB at least theoretically allows the nasal demons, so unless our compiler makes a stronger guarantee, we always have to worry.bob wrote: Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)