hgm wrote:The MESI protocol has taken care of this for more than a decade now. It doesn't matter at all in which order item are flushed from cache to memory. A reader will just not be allowed to access stale memory before the cache line is completely written. Snooping hardware will see to it that being in cache is for all intents and purposes equivalent to being in memory.
If there is a problem, it is certainly not in that area.
There are several problems. Let me give you a very simple case to think about.
We have 3 words of memory, call 'em A, B and C. I claim that no matter what order you write them in, they can show up in memory in a _different_ order. Here is one way, there are others:
First I store A, which is the first word on cache line X. Then I store B, which is the first word on cache line Y. Then I store C which is the second word in cache line X.
Both lines get invalidated because they get replaced by something else. Which one gets invalidated first? Who knows. But that one gets written back first. If cache line X is written first, then A and C get updated in memory before C. If Line Y is written first, then B gets written before A and C.
Look up "write-combining buffers" for lots of examples where this can be a problem. not if you use atomic locks, but fences don't solve the problem. A fence just forces all writes to be done before you proceed beyond the fence. But that is "all writes" from the processor's perspective, if you haven't yet done a write because the optimizer moved it around, then you will see perhaps wrong results.
Bottom line is quite simple. If you want to have readers and writers, and make certain that a reader can't access the data if a writer has modified it, you need to use the classic mutex/atomic-lock critical-section approach. Any attempt to bypass the overhead of the lock is going to lead to many very subtle (and perhaps rare, but not never) bugs requiring significant debugging time. The right way is to simply prevent the problem from happening in a safe and reliable way, and either eliminate the need for the critical section (the lockless hashing idea is one example of this) or else make the program immune to failure if the writes are not all completed before a reader starts accessing the data.
To tell anyone to do this in any other way is simply doing them a disservice and will lead to much unnecessary debugging. And indeed sometimes these race windows are so very narrow they almost never occur. But "almost never" is not "never" and leaving such a window of opportunity for failure is not the way to write reliable software.
To understand this problem, you have to step back one or two steps further than you currently are. An individual CPU executes things out of order already, so getting out of order stores is already common. Then you have multiple instruction streams with multiple cores or packages, further adding to this. And then cache introduces its own set of quirks by trying to optimize memory writes with the write-combining stuff, I can't safely read something and assume that if it is unchanged, the other "associated data" is also unchanged. There are too many possible pathways that violate this.
There's some good info in the linux kernel documentation, where they talk about this issue specifically and "at length" and complain bitterly about how lax the X86 memory write policies really are...
As far as your MESI (or really MESIF now for Intel or MOESI for AMD) comment goes, it does address some things. But the hole remains since you can't know what order things are physically executed in, and how cache modifies the write order of that already rearranged stream.