hgm wrote:bob wrote:Never said nor implied that. I gave the simple example of 3 values. One is your "counter" value. The other two are the data values. There is _no_ way you can guarantee the order those values get written back to memory. That's all I have said. It has nothing to do with whether I understand cache coherency or not (I do) it has _everything_ to do with whether you understand the overall flow of data thru the CPUs, caches, and back to memory or not (you don't.).
What a coincidence, I was thinking exactly the same. And what you say proves it: You keep flaunting your ignorance by focusing on the order things are written in memory, while that is totally irrelevant. So they are written in an unpredictable order to memory, who cares?
The only thing that matters is if A is written before B by CPU #1, is it possible that CPU #2 sees B change before A. Cache coherency means it cannot. Wheter A or B is written first to memory, or a year later, or not at all is totally irrelevant. That is what Intel and AMD guarantee. This is what the MESI protocol is for. That is what "globally visible" means in the description of the sfence instruction. The writing of B (following the sfence) is delayed until the writing of A (before the sfence) has become
globally visible. That means that if a _global_ reader (i.e. any other CPU in the system) sees B changing before he sees A changing, (i.e. before the change of A became globally visible), he actually sees B changing _before_ it is written, violating causality.
This is just another one of your superstitions about x86...
OK, you are not going to see what you don't want to see. In a C program it is not possible to control the order things are written, without a lot of intrusive and architecturally-specific code that kills any hope at portability or that the program will work on different processor generations with no changes.
The peterson algorithm is a horrible solution. Look at what you have to do if you want to run on 8 cores. The cache-forwarding overhead is tremendous, and is actually worse than simply using an atomic lock. It has a potential for self-blocking and waiting when no one is actually in the critical section. It is about as ugly an idea as one could use,
The idea here is efficiency, and Gerd was trying to get around the overhead of a lock, and you want to add far more overhead by having all the processors set a group of values that are necessarily shared and therefore forwarded everywhere. If you are happy with such a solution, which is worse than the original problem, fine. I'm not. The entire purpose of parallel search is to improve performance, not introduce sequential bottlenecks.
But feel free to carry on by proposing solutions that are lousy, because you don't take the time to look at them carefully to see how they actually work within the architecture. The correct solution on X86 is to use a lock, or else work on an approach that doesn't need one. Come back after you have actually written a parallel search and understand the overhead issues, then you will be prepared to discuss the issues in a context that is meaningful.
As far as the ordering goes, until you "get it" you are not going to get it. If you write everything in assembly language, and carefully protect things with fences, you can make _some_ progress. But you still have no control over how the two or more instruction streams interleave instructions and screw up ordering. I can always read an old value before you store the new one, and you can't do a thing to guarantee that you store both values before I read either one, unless you use a lock. Again, this is _not_ a cache issue. Never has been.