bob wrote:Has no chance of being correct. Writes are done out of order. Reads and writes are done out of order. You can read the counter, write it back, read the value, and get the "value" before you have actually written the counter back to memory. Unless you use the "lock" prefix. And then you incur the usual atomicity overhead you are trying to avoid by not using a real lock in the first place.
There are also race conditions.
You read at some point a value V1 from the counter.
A second thread prepares to modify the value so it increments the counter and modifies the value.
Unfortunately, you read the counter before it is modified, and then don'tread the new value thinking it is unmodified value because your counter read happened before the counter was stored.
This is the way to a lifelong debugging assignment.
If out-of-order execution woulld work as you describe it, it would never be possible for any program to use the memory: there is never a guarantee that you read back what you write, because the write might not have been executed yet when you read.
Fortunately CPU designers have made out-of-order execution completely transparent, and have a lot of synchronization hardware in the load/store pipes to prevent effects like you describe. Reads and writes do not overtake each other this way.
With SMP there is the MESI protocol implemented in hardware to guarantee cache coherency. After one thread starts writing, the cache line is immediately invalidated in all other CPU caches, and there is no way any of them could start reading from it before it has reloaded that cache line, which cannot happen before the other CPU is done writing it, and ready to flush it to memory. You have to make sure that the counter is in the same cache line as the shared variable, of course.
There might indeed be a problem with out-of-order reads, although it is unlikely that they would be scheduled for reverse ordering if neither of them has to be delayed because of a data dependency (in the addres calculation). To make sure the read of the variable is done before that of the counter even in an out-of-order machine you can give the address calculation of the counter access a (false) dependency on the result of the load, e.g.
Code: Select all
myVariable = sharedVariable;
myCounter = *(&sharedCounter + (0 & myVariable));
(making sure the optimizer does not try to outsmart you, by handcoding this in assembler).
This guarantees the loading of the counter cannot evens start before the copy of the variable is in a CPU register. (But with a modest delay, and without stalling the CPU pipe.)