Sync question

hgm · Post by **hgm** » Sat Jul 25, 2009 10:33 am

bob wrote:I understand those. Just not on X86.

That much is clear. It seems the entire concept of "cache coherency" escapes you. You seem to think that CPU B would read stale results from memory as long as CPU A, holding an altered value in its cache, has not yet flushed it...

But the issue here is "is there a way to prevent readers from reading while a writer is writing, or prevent a writer from writing, while readers are reading, without a lock?" On X86, the answer is "no". That was the original question, and the correct answer to that question.

That is not a correct answer at all. Peterson's algorithm would work on x86, provided you take proper care of forcing stores and loads (to cache!) to occur in program order (using sfence etc.). And this involves no bus lock.

bob · Post by **bob** » Sat Jul 25, 2009 4:55 pm

hgm wrote:
bob wrote:I understand those. Just not on X86.
That much is clear. It seems the entire concept of "cache coherency" escapes you. You seem to think that CPU B would read stale results from memory as long as CPU A, holding an altered value in its cache, has not yet flushed it...

Never said nor implied that. I gave the simple example of 3 values. One is your "counter" value. The other two are the data values. There is _no_ way you can guarantee the order those values get written back to memory. That's all I have said. It has nothing to do with whether I understand cache coherency or not (I do) it has _everything_ to do with whether you understand the overall flow of data thru the CPUs, caches, and back to memory or not (you don't.).

If you think a lockless lock will work on X86, go right ahead and use it all you want. The rest of us that actually understand the issues will continue to understand that you either use atomic locks on X86, or you deal with race conditions that will cause problems. There is no middle-ground.

But the issue here is "is there a way to prevent readers from reading while a writer is writing, or prevent a writer from writing, while readers are reading, without a lock?" On X86, the answer is "no". That was the original question, and the correct answer to that question.
That is not a correct answer at all. Peterson's algorithm would work on x86, provided you take proper care of forcing stores and loads (to cache!) to occur in program order (using sfence etc.). And this involves no bus lock.

I suggest you re-read the peterson algorithm. It is not quite as "functional" as you think and his comments reflect this.

But in any case, I see no point in arguing further. The X86 is an out-of-order memory write box. Always has been. Fortunately it has the lock prefix to allow correctly implemented critical sections, if you can stand the overhead. Trying to do this without a lock is simply going to waste a lot of time on debugging. And many want to write programs that will run correctly regardless of the architecture, which makes this discussion even more pointless overall.

hgm · Post by **hgm** » Sat Jul 25, 2009 7:11 pm

bob wrote:Never said nor implied that. I gave the simple example of 3 values. One is your "counter" value. The other two are the data values. There is _no_ way you can guarantee the order those values get written back to memory. That's all I have said. It has nothing to do with whether I understand cache coherency or not (I do) it has _everything_ to do with whether you understand the overall flow of data thru the CPUs, caches, and back to memory or not (you don't.).

What a coincidence, I was thinking exactly the same. And what you say proves it: You keep flaunting your ignorance by focusing on the order things are written in memory, while that is totally irrelevant. So they are written in an unpredictable order to memory, who cares?

The only thing that matters is if A is written before B by CPU #1, is it possible that CPU #2 sees B change before A. Cache coherency means it cannot. Wheter A or B is written first to memory, or a year later, or not at all is totally irrelevant. That is what Intel and AMD guarantee. This is what the MESI protocol is for. That is what "globally visible" means in the description of the sfence instruction. The writing of B (following the sfence) is delayed until the writing of A (before the sfence) has become globally visible. That means that if a _global_ reader (i.e. any other CPU in the system) sees B changing before he sees A changing, (i.e. before the change of A became globally visible), he actually sees B changing _before_ it is written, violating causality.

This is just another one of your superstitions about x86...

bob · Post by **bob** » Sat Jul 25, 2009 8:29 pm

hgm wrote:
bob wrote:Never said nor implied that. I gave the simple example of 3 values. One is your "counter" value. The other two are the data values. There is _no_ way you can guarantee the order those values get written back to memory. That's all I have said. It has nothing to do with whether I understand cache coherency or not (I do) it has _everything_ to do with whether you understand the overall flow of data thru the CPUs, caches, and back to memory or not (you don't.).
What a coincidence, I was thinking exactly the same. And what you say proves it: You keep flaunting your ignorance by focusing on the order things are written in memory, while that is totally irrelevant. So they are written in an unpredictable order to memory, who cares?

The only thing that matters is if A is written before B by CPU #1, is it possible that CPU #2 sees B change before A. Cache coherency means it cannot. Wheter A or B is written first to memory, or a year later, or not at all is totally irrelevant. That is what Intel and AMD guarantee. This is what the MESI protocol is for. That is what "globally visible" means in the description of the sfence instruction. The writing of B (following the sfence) is delayed until the writing of A (before the sfence) has become globally visible. That means that if a _global_ reader (i.e. any other CPU in the system) sees B changing before he sees A changing, (i.e. before the change of A became globally visible), he actually sees B changing _before_ it is written, violating causality.

This is just another one of your superstitions about x86...

OK, you are not going to see what you don't want to see. In a C program it is not possible to control the order things are written, without a lot of intrusive and architecturally-specific code that kills any hope at portability or that the program will work on different processor generations with no changes.

The peterson algorithm is a horrible solution. Look at what you have to do if you want to run on 8 cores. The cache-forwarding overhead is tremendous, and is actually worse than simply using an atomic lock. It has a potential for self-blocking and waiting when no one is actually in the critical section. It is about as ugly an idea as one could use,

The idea here is efficiency, and Gerd was trying to get around the overhead of a lock, and you want to add far more overhead by having all the processors set a group of values that are necessarily shared and therefore forwarded everywhere. If you are happy with such a solution, which is worse than the original problem, fine. I'm not. The entire purpose of parallel search is to improve performance, not introduce sequential bottlenecks.

But feel free to carry on by proposing solutions that are lousy, because you don't take the time to look at them carefully to see how they actually work within the architecture. The correct solution on X86 is to use a lock, or else work on an approach that doesn't need one. Come back after you have actually written a parallel search and understand the overhead issues, then you will be prepared to discuss the issues in a context that is meaningful.

As far as the ordering goes, until you "get it" you are not going to get it. If you write everything in assembly language, and carefully protect things with fences, you can make _some_ progress. But you still have no control over how the two or more instruction streams interleave instructions and screw up ordering. I can always read an old value before you store the new one, and you can't do a thing to guarantee that you store both values before I read either one, unless you use a lock. Again, this is _not_ a cache issue. Never has been.

bob · Post by **bob** » Sat Jul 25, 2009 8:37 pm

bob wrote:
hgm wrote:
bob wrote:Never said nor implied that. I gave the simple example of 3 values. One is your "counter" value. The other two are the data values. There is _no_ way you can guarantee the order those values get written back to memory. That's all I have said. It has nothing to do with whether I understand cache coherency or not (I do) it has _everything_ to do with whether you understand the overall flow of data thru the CPUs, caches, and back to memory or not (you don't.).
What a coincidence, I was thinking exactly the same. And what you say proves it: You keep flaunting your ignorance by focusing on the order things are written in memory, while that is totally irrelevant. So they are written in an unpredictable order to memory, who cares?

The only thing that matters is if A is written before B by CPU #1, is it possible that CPU #2 sees B change before A. Cache coherency means it cannot. Wheter A or B is written first to memory, or a year later, or not at all is totally irrelevant. That is what Intel and AMD guarantee. This is what the MESI protocol is for. That is what "globally visible" means in the description of the sfence instruction. The writing of B (following the sfence) is delayed until the writing of A (before the sfence) has become globally visible. That means that if a _global_ reader (i.e. any other CPU in the system) sees B changing before he sees A changing, (i.e. before the change of A became globally visible), he actually sees B changing _before_ it is written, violating causality.

This is just another one of your superstitions about x86...
OK, you are not going to see what you don't want to see. In a C program it is not possible to control the order things are written, without a lot of intrusive and architecturally-specific code that kills any hope at portability or that the program will work on different processor generations with no changes.

The peterson algorithm is a horrible solution. Look at what you have to do if you want to run on 8 cores. The cache-forwarding overhead is tremendous, and is actually worse than simply using an atomic lock. It has a potential for self-blocking and waiting when no one is actually in the critical section. It is about as ugly an idea as one could use,

The idea here is efficiency, and Gerd was trying to get around the overhead of a lock, and you want to add far more overhead by having all the processors set a group of values that are necessarily shared and therefore forwarded everywhere. If you are happy with such a solution, which is worse than the original problem, fine. I'm not. The entire purpose of parallel search is to improve performance, not introduce sequential bottlenecks.

But feel free to carry on by proposing solutions that are lousy, because you don't take the time to look at them carefully to see how they actually work within the architecture. The correct solution on X86 is to use a lock, or else work on an approach that doesn't need one. Come back after you have actually written a parallel search and understand the overhead issues, then you will be prepared to discuss the issues in a context that is meaningful.

As far as the ordering goes, until you "get it" you are not going to get it. If you write everything in assembly language, and carefully protect things with fences, you can make _some_ progress. But you still have no control over how the two or more instruction streams interleave instructions and screw up ordering. I can always read an old value before you store the new one, and you can't do a thing to guarantee that you store both values before I read either one, unless you use a lock. Again, this is _not_ a cache issue. Never has been.

Don't know why you won't go back and find Zach's test. I reported on this issue with hashing a couple of months ago. Someone raised the same point and said "this can't possibly happen." But it not only can, it does. I don't remember the frequency of different orders of writes when Zach ran his test, I only know that I had to fix the problem because it was making my program crash.

P1 can write A1 and B1 to variables A and B. P2 can write A2 and B2 to A and B. And what you end up with is one of four possibilities. Most commonly you will get A1/B1 or A2/B2, but you will also see some instances of A1/B2 and A2/B1. You can't force two stores to be done in two successive cycles. And since you can't, you can't control what happens between the two stores with other threads running their own code...

Again, the right solution is to use a lock. Or design around the need so that interleaved data won't hurt anything. Or avoid parallel programming completely.

hgm · Post by **hgm** » Sat Jul 25, 2009 10:24 pm

So you don't even understand what the problem is...

The question (and the solution I gave for it) was _not_ about two writers writing the same variable and messing up each others data. The fact that writes are strictly performed and globally visible in program order has no impact on that at all, as it is determined by the timing of writes on _different_ CPUs.

The question was about keeping _readers_ out when someone is writing.

So of course Zach finds all combinations of corrupted data. Now tell us something we did not already know long before anyone tried it, and that has something to do with the problem at hand.

Furthermore, I think you should clarify what you mean by "lock". Normally Peterson's algorithm is considered a lock, but you seem to deny that this would work on an x86. Did Zach try the two-writer's problem with Peterson locking?

bob · Post by **bob** » Sat Jul 25, 2009 11:54 pm

hgm wrote:So you don't even understand what the problem is...

The question (and the solution I gave for it) was _not_ about two writers writing the same variable and messing up each others data. The fact that writes are strictly performed and globally visible in program order has no impact on that at all, as it is determined by the timing of writes on _different_ CPUs.

The question was about keeping _readers_ out when someone is writing.

And what inane idea did you suggest? Peterson's algorithm? Which does a _bunch_ of writes itself???

The way to keep readers out is with a simple lock. As I said. The other clumsy attempts either have race conditions _or_ move overhead than the lock itself.

So of course Zach finds all combinations of corrupted data. Now tell us something we did not already know long before anyone tried it, and that has something to do with the problem at hand.

Furthermore, I think you should clarify what you mean by "lock". Normally Peterson's algorithm is considered a lock, but you seem to deny that this would work on an x86. Did Zach try the two-writer's problem with Peterson locking?

I did not deny that Peterson's algorithm will work. I pointed out that it does have a small flaw that is well known, where you can end up waiting when no one is actually in the critical section. And secondly that it is hugely complex for > 2 threads and requires a ton of writes from each thread which certainly ramps up the cache forwarding traffic level to the point that it is, for some small number of threads and beyond, physically worse performance-wise than a simple atomic lock.

What I mean by "lock" (more specifically "atomic lock") is exactly what everyone else means by the term. A variable that is usually zero but which can be atomically set to one with the guarantee that multiple threads can try to acquire the lock at the same time and only one will gain control.

hgm · Post by **hgm** » Sun Jul 26, 2009 9:44 am

bob wrote:And what inane idea did you suggest? Peterson's algorithm? Which does a _bunch_ of writes itself???

So you did not read that either? You are just disagreeing with it out of habit, shouting that it will not work without even having looked at it?

The way to keep readers out is with a simple lock. As I said. The other clumsy attempts either have race conditions _or_ move overhead than the lock itself.

Simple but expensive. And you must speak for your own attempts, because mine seem to work fine...

What I mean by "lock" (more specifically "atomic lock") is exactly what everyone else means by the term. A variable that is usually zero but which can be atomically set to one with the guarantee that multiple threads can try to acquire the lock at the same time and only one will gain control.

Ah, so now you suddenly make up the qualifier "atomic" to add to it. I suspected such. What everyone else means by "lock" is of course not limited to atomic locks. A lock in general is simply a device to keep tresspassers out of the critical section, by whatever means.

OK, as by now you must have succeeded chasing away anyone that could even be remotely interested, there is no point in continuing this discussion. So I am out of it. I don't run a treatment center for dyslectics, so seek help elsewhere...

bob · Post by **bob** » Sun Jul 26, 2009 4:21 pm

hgm wrote:
bob wrote:And what inane idea did you suggest? Peterson's algorithm? Which does a _bunch_ of writes itself???
So you did not read that either? You are just disagreeing with it out of habit, shouting that it will not work without even having looked at it?

First, I did not say "it will not work". Once you learn to read, you will see that. Second, the "not even having looked at it" is funny. I teach parallel programming and this is covered in most books. So it does work, it does _not_ work very well if you already have a real atomic lock instruction. Peterson's algorithm was explicitly developed for use on machines that do _not_ have a more efficient atomic lock mechanism. You did know that, right?

The way to keep readers out is with a simple lock. As I said. The other clumsy attempts either have race conditions _or_ move overhead than the lock itself.
Simple but expensive. And you must speak for your own attempts, because mine seem to work fine...

What I mean by "lock" (more specifically "atomic lock") is exactly what everyone else means by the term. A variable that is usually zero but which can be atomically set to one with the guarantee that multiple threads can try to acquire the lock at the same time and only one will gain control.
Ah, so now you suddenly make up the qualifier "atomic" to add to it. I suspected such. What everyone else means by "lock" is of course not limited to atomic locks. A lock in general is simply a device to keep tresspassers out of the critical section, by whatever means.

OK, as by now you must have succeeded chasing away anyone that could even be remotely interested, there is no point in continuing this discussion. So I am out of it. I don't run a treatment center for dyslectics, so seek help elsewhere...

hgm · Post by **hgm** » Sun Jul 26, 2009 6:22 pm

If you want to continue playing sillybuggers, you can do it on your own. To all other people reading this thread (in the unlikely case there are any), it is already obvious that I was not referring to Peterson's algorithm.

Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question