hgm wrote:bob wrote:Has no chance of being correct. Writes are done out of order. Reads and writes are done out of order. You can read the counter, write it back, read the value, and get the "value" before you have actually written the counter back to memory. Unless you use the "lock" prefix. And then you incur the usual atomicity overhead you are trying to avoid by not using a real lock in the first place.
There are also race conditions.
You read at some point a value V1 from the counter.
A second thread prepares to modify the value so it increments the counter and modifies the value.
Unfortunately, you read the counter before it is modified, and then don'tread the new value thinking it is unmodified value because your counter read happened before the counter was stored.
This is the way to a lifelong debugging assignment.
If out-of-order execution woulld work as you describe it, it would never be possible for any program to use the memory: there is never a guarantee that you read back what you write, because the write might not have been executed yet when you read.
I am not talking about "out of order execution". I am talking about "out of order writes". Just because you first store in A, and then in B, does _not_ mean that the memory value for A gets updated first. That's the famous X86 out of order write issue. The issue you raise is a trivial one handled by the cache. The case I am talking about is much more devious in how it wrecks programs. All cache systems on X86 recognize the read after a write and get the new value since they don't need to go to memory to get it.
Fortunately CPU designers have made out-of-order execution completely transparent, and have a lot of synchronization hardware in the load/store pipes to prevent effects like you describe. Reads and writes do not overtake each other this way.
Sorry, but they do. You are simply looking at the wrong reason for it happening. If you store at memory locations A, B, C and D in that order, the order the writes are done to memory could be any of 16 possible permutations of that order.
With SMP there is the MESI protocol implemented in hardware to guarantee cache coherency. After one thread starts writing, the cache line is immediately invalidated in all other CPU caches, and there is no way any of them could start reading from it before it has reloaded that cache line, which cannot happen before the other CPU is done writing it, and ready to flush it to memory. You have to make sure that the counter is in the same cache line as the shared variable, of course.
Just listen to someone with years of experience on this topic. The CPU can't do multiple stores with one instruction. So yes, the cache is clever enough to provide a form of data integrity, but not as you are thinking. CPU A can store word 1 in a cache block, then CPU B can store word 1 and 2, then CPU A can store word 2. And now you have word 1 from CPUB and word 2 from CPUA. Nothing makes the instructions execute atomically in a group.
There might indeed be a problem with out-of-order reads, although it is unlikely that they would be scheduled for reverse ordering if neither of them has to be delayed because of a data dependency (in the addres calculation). To make sure the read of the variable is done before that of the counter even in an out-of-order machine you can give the address calculation of the counter access a (false) dependency on the result of the load, e.g.
Code: Select all
myVariable = sharedVariable;
myCounter = *(&sharedCounter + (0 & myVariable));
(making sure the optimizer does not try to outsmart you, by handcoding this in assembler).
This guarantees the loading of the counter cannot evens start before the copy of the variable is in a CPU register. (But with a modest delay, and without stalling the CPU pipe.)
All I can say is "dream on". But it doesn't work as you think.