Sync question

bob · Post by **bob** » Sun Jul 26, 2009 8:50 pm

hgm wrote:If you want to continue playing sillybuggers, you can do it on your own. To all other people reading this thread (in the unlikely case there are any), it is already obvious that I was not referring to Peterson's algorithm.

You certainly posted a link to it. But whatever you think, feel free to continue. The atomic lock is the correct way to solve this problem on X86, there is no other solution that is faster nor easier nor less error prone. Those of us that actually develop parallel algorithms understand this. Those that wish they did will not follow. My intent is to show people the _right_ way to accomplish a task they are interested in doing. And that's exactly what I did. What you did was simply a disservice to a beginner.

bob · Post by **bob** » Sun Jul 26, 2009 8:52 pm

Bo Persson wrote:
Gerd Isenberg wrote:
Bo Persson wrote: This might work with later versions of MSVC (with an extension to the volatile semantics) on x86 hardware. On most everything else, you have race conditions in the while loops, as well as no guarantees about updates happening in order on other CPU chips.
What is the MSVC extension to the volatile semantics? Is it related to read- and write barriers to ensure the order of instructions?
Later versions (from VC2005 or so) has a guarantee that not only are the volatile objects read and written correctly, but the optimizer will also NOT move other non-volatile reads and writes past the volatile ones. Otherwise, reading the data you are trying to protect might be hoisted out of the loop.

http://msdn.microsoft.com/en-us/library/12a04hfd.aspx

Again, this might improve correctness but also limits what the optimizer will do to the code!

For GCC/ICC (Intel) they also use the "volatile" keyword on inline functions, including those with asm statements. The compiler can not move stuff that happens before the call to a volatile function so that it follows the function call, which makes locks safe and optimizer-proof as well.

hgm · Post by **hgm** » Sun Jul 26, 2009 9:37 pm

bob wrote: What you did was simply a disservice to a beginner.

I would say that what you dis is a disservice to anyone, by your continuous spreading of falsehoods about x86 architecture. For starters, I was not addressing a beginner, I was answering a specific question of an _expert_, (Gerd), which knows everything about atomic locks and machine architecture, and was looking for an innovative and faster method in a certain limiting case (many reads, virtually no writes).

So I offfered one.

That you don't like that / feel the need to read the pseudo-code / do not understand it for lack of x86 knowledge, is your problem.

Do not annoy others with it.

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Jul 26, 2009 10:04 pm

hgm wrote:
bob wrote: What you did was simply a disservice to a beginner.
I would say that what you dis is a disservice to anyone, by your continuous spreading of falsehoods about x86 architecture. For starters, I was not addressing a beginner, I was answering a specific question of an _expert_, (Gerd), which knows everything about atomic locks and machine architecture, and was looking for an innovative and faster method in a certain limiting case (many reads, virtually no writes).

So I offfered one.

That you don't like that / feel the need to read the pseudo-code / do not understand it for lack of x86 knowledge, is your problem.

Do not annoy others with it.

I am not really an expert with multi-threading or processing topics. I have not the competence to say who is (partly) wrong or right of you, and tend to agree with Bob on that matter, also for simplicity and because I don't understand all the traps, and race conditions in multi-threaded code. My initial question was "academical" without practical relevance for me. So I will study your proposal and try to understand it later, but your quarrels made me lose some interest on that issue. Actually with hash, I rely on 128-bit moves to probe and write entries. For pawn-hash and other tables I tend to use local ones for each thread.

bob · Post by **bob** » Sun Jul 26, 2009 10:44 pm

hgm wrote:
bob wrote: What you did was simply a disservice to a beginner.
I would say that what you dis is a disservice to anyone, by your continuous spreading of falsehoods about x86 architecture. For starters, I was not addressing a beginner, I was answering a specific question of an _expert_, (Gerd), which knows everything about atomic locks and machine architecture, and was looking for an innovative and faster method in a certain limiting case (many reads, virtually no writes).

So I offfered one.

That you don't like that / feel the need to read the pseudo-code / do not understand it for lack of x86 knowledge, is your problem.

Do not annoy others with it.

Please show me _any_ "falsehood" I have stated.

As far as what you "offered" it is worse, as I said, from a performance perspective. And if one is not very careful, it won't even work correctly.

carry on with the nonsense.

You seem to not be able to get past the cache issues, when I have said from the get-go that this is _not_ a cache issue (out of order writes). It is simply an issue one has to account for, period.

hgm · Post by **hgm** » Mon Jul 27, 2009 10:08 am

bob wrote:Please show me _any_ "falsehood" I have stated.

That readers in an x86 SMP environment can see variables change in an other order than they are written, (as forced by sfence), of course.

As far as what you "offered" it is worse, as I said, from a performance perspective. And if one is not very careful, it won't even work correctly.

And this is yet another falsehood. Two unlocked in-order reads on an item in a shared cache line have a latency of some 7 cycles only, for a throughput of 3 instructions without pipeline stalls. i.e. one to two orders of magnitude _faster_ than an atomic lock.

carry on with the nonsense.

That seems indeed to be your attitude. I would say: "stop with the nonsence". But it is said to deaf ears... As your statement about the performance above shows, you keep it coming...

You seem to not be able to get past the cache issues, when I have said from the get-go that this is _not_ a cache issue (out of order writes). It is simply an issue one has to account for, period.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Mon Jul 27, 2009 1:23 pm

Code: Select all

write&#40;)
&#123;
  lock&#40;);
  counter++;
  sfence;  // not needed on x86 and amd64 &#40;but see note&#41;
  // do write here
  unlock&#40;);
&#125;
read&#40;)
&#123;
  static int x = -1;

  // do read here
  lfence; // not needed on x86 and amd64 &#40;but see note&#41;
  if &#40;counter != x&#41; // changed; read might be corrupted
  &#123;
    lock&#40;);
    // do read here
    x = counter;
    unlock&#40;);
  &#125;
&#125;

So, the code needs both a sfence and lfence on weakly ordered CPUs (and will hence be slow), but it will work correctly and fast on x86 and derivates, due to the store/store and load/load ordering that is enforced.

See:

http://www.amd.com/us-en/assets/content ... /24593.pdf, page 164

And note that write writes A, then B, whereas read reads B, then A.

Note that the *COMPILER* is still allowed to reorder the loads and stores, so even on x86/amd64, you still actually need the barriers, even though they can be compiler barriers (volatile) and not actual hardware serialization.

hgm · Post by **hgm** » Mon Jul 27, 2009 7:57 pm

Gian-Carlo Pascutto wrote:So, the code needs both a sfence and lfence on weakly ordered CPUs (and will hence be slow), but it will work correctly and fast on x86 and derivates, due to the store/store and load/load ordering that is enforced.

Agreed. But an lfence does not need be very expensive on any machine, I think. If it would be unduly slow, I think the "home-made" lfence I proposed, through

Code: Select all

movl memA, %ecx
movl %ecx, %eax
andl $0, %ecx
movl memB&#40;%ecx&#41;, %ebx

is an alternative to force reading of B to start only after the read for memA is complete, and is hardly more expensive than any normal load.

The original problem was formulated in such a way (reads more frequent than writes by at least a factor 10,000) that only the efficiency of reading matters. So it doesn't matter how expensive sfence is. (Although it is hard to imagine sfence is more expensive than a mutex based on xchg, as there the CPU also has to wait until the written data is "globally visible" before the read data can be made available to the CPU, or two CPUs could both grab the lock.)

Under these conditions, the locations to be read are likely to be almost always in cache for all CPUs, in the Shared state. So the normal loads would have quite low latency. Only on the occasional write all these entries would be invalidated, and the readers would have to acquire the line again from memory, or directly from their sibbling CPU. (Assuming they do not communicate through L2, as Core 2 Duo, or through L3 as Core i7.)

Gian-Carlo Pascutto · Mon Jul 27, 2009 10:24 pm

hgm wrote:
Gian-Carlo Pascutto wrote:So, the code needs both a sfence and lfence on weakly ordered CPUs (and will hence be slow), but it will work correctly and fast on x86 and derivates, due to the store/store and load/load ordering that is enforced.
Agreed. But an lfence does not need be very expensive on any machine, I think. If it would be unduly slow, I think the "home-made" lfence I proposed, through
Code: Select all
movl memA, %ecx
movl %ecx, %eax
andl $0, %ecx
movl memB&#40;%ecx&#41;, %ebx
is an alternative to force reading of B to start only after the read for memA is complete, and is hardly more expensive than any normal load.

I'm not sure what this is supposed to accomplish. Either the fences are needed, and this code is buggy, or the fences aren't needed, and this code is pointless.

bob · Post by **bob** » Mon Jul 27, 2009 10:28 pm

hgm wrote:
bob wrote:Please show me _any_ "falsehood" I have stated.
That readers in an x86 SMP environment can see variables change in an other order than they are written, (as forced by sfence), of course.

OK, some examples.

For each of the following, I want to execute the two following instructions:

mov A, eax
mov B, ebx

They appear exactly in the order in my program.

The _only_ way you can _guarantee_ that the the two stores are done _exactly_ in the order written is as follows:

mov A, eax
sfence
mov B, eax
sfence

Which incurs far more overhead than the original atomic lock I mentioned.

You can't get away with this:

mov A, eax
mov B, ebx
sfence

Because all the sfence instruction guarantees is that when you execute it, the processor hangs until both of the preceeding mov instructions have executed. But it does _not_ guarantee that B won't be written first. It just guarantees that before you proceed, _both_ will be written in unknown order.

I didn't notice your mentioning this. Because I don't believe you understand what is going on at this level. Which is OK, but it is not OK to just say "sfence" solves this when it most certainly does not.

So, if you want to guarantee the order of two stores, you have to put an sfence after each. And you completely stall the CPU while it is waiting on the write, and won't execute any other instructions out of order to fill in the idle cycles in the various pipelines. Lousy solution.

Otherwise you can use a lock where you write and a lock where you read, so that a reader can't access _either_ value until the lock is released. You still need an sfence (store fence) prior to clearing the lock or you can still get zapped since the instruction to clear the lock is a simple mov lock, 0 and the preceeding stores might not both be completed prior to that.

If you don't program in assembly language, you don't get access to the sfence stuff anyway, and unless you know you need it, you have subtle timing issues that are a royal pain to deal with, because they are very hard to detect.

Out of order writes are a x86 characteristic. No way around them unless you want to intersperse sfence instructions throughout your code and slow it down by at least a couple of orders of magnitude or more.

That is exactly what I have been saying all along. Peterson's code has a problem on X86 without using sfence. In fact, any program that depends on A being written before B (in the above example) has a problem unless you are willing to slow things to a crawl with sfence.

That is a simple explanation of the problem I have been hitting on from day one. I have _repeatedly_ said "this is not a cache issue" because it has _nothing_ to do with cache. If cache never sees the mov A, eax write, it can't do a thing to help us. And it won't always see that write before the write to B. Hence no order guarantee and cache can't do a thing about it.

Note also that it does not matter at all what the state of A and B are with respect to other caches or even the cache on the current processor. Neither might be cached anywyere. Both might be caches as shared everywhere. The problem still persists in exactly the same way with no help from cache to solve it at all.

As far as what you "offered" it is worse, as I said, from a performance perspective. And if one is not very careful, it won't even work correctly.

And this is yet another falsehood. Two unlocked in-order reads on an item in a shared cache line have a latency of some 7 cycles only, for a throughput of 3 instructions without pipeline stalls. i.e. one to two orders of magnitude _faster_ than an atomic lock.

Again, you have the out-of-order problem unless you sfence _each_ write on the writer end. And that is worse, still, as I said. Locks don't stall the CPU, while sfence instructions halt everything.

carry on with the nonsense.
That seems indeed to be your attitude. I would say: "stop with the nonsence". But it is said to deaf ears... As your statement about the performance above shows, you keep it coming...

The only thing I see is that you really don't understand the issues involved, since you haven't written this type of code before. You'll "get it" eventually, or else never write a successful parallel algorithm of any kind.

You seem to not be able to get past the cache issues, when I have said from the get-go that this is _not_ a cache issue (out of order writes). It is simply an issue one has to account for, period.

Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question