You certainly posted a link to it. But whatever you think, feel free to continue. The atomic lock is the correct way to solve this problem on X86, there is no other solution that is faster nor easier nor less error prone. Those of us that actually develop parallel algorithms understand this. Those that wish they did will not follow. My intent is to show people the _right_ way to accomplish a task they are interested in doing. And that's exactly what I did. What you did was simply a disservice to a beginner.hgm wrote:If you want to continue playing sillybuggers, you can do it on your own. To all other people reading this thread (in the unlikely case there are any), it is already obvious that I was not referring to Peterson's algorithm.
Sync question
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Sync question
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Sync question
For GCC/ICC (Intel) they also use the "volatile" keyword on inline functions, including those with asm statements. The compiler can not move stuff that happens before the call to a volatile function so that it follows the function call, which makes locks safe and optimizer-proof as well.Bo Persson wrote:Later versions (from VC2005 or so) has a guarantee that not only are the volatile objects read and written correctly, but the optimizer will also NOT move other non-volatile reads and writes past the volatile ones. Otherwise, reading the data you are trying to protect might be hoisted out of the loop.Gerd Isenberg wrote:What is the MSVC extension to the volatile semantics? Is it related to read- and write barriers to ensure the order of instructions?Bo Persson wrote: This might work with later versions of MSVC (with an extension to the volatile semantics) on x86 hardware. On most everything else, you have race conditions in the while loops, as well as no guarantees about updates happening in order on other CPU chips.
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Again, this might improve correctness but also limits what the optimizer will do to the code!
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Sync question
I would say that what you dis is a disservice to anyone, by your continuous spreading of falsehoods about x86 architecture. For starters, I was not addressing a beginner, I was answering a specific question of an _expert_, (Gerd), which knows everything about atomic locks and machine architecture, and was looking for an innovative and faster method in a certain limiting case (many reads, virtually no writes).bob wrote: What you did was simply a disservice to a beginner.
So I offfered one.
That you don't like that / feel the need to read the pseudo-code / do not understand it for lack of x86 knowledge, is your problem.
Do not annoy others with it.
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Sync question
I am not really an expert with multi-threading or processing topics. I have not the competence to say who is (partly) wrong or right of you, and tend to agree with Bob on that matter, also for simplicity and because I don't understand all the traps, and race conditions in multi-threaded code. My initial question was "academical" without practical relevance for me. So I will study your proposal and try to understand it later, but your quarrels made me lose some interest on that issue. Actually with hash, I rely on 128-bit moves to probe and write entries. For pawn-hash and other tables I tend to use local ones for each thread.hgm wrote:I would say that what you dis is a disservice to anyone, by your continuous spreading of falsehoods about x86 architecture. For starters, I was not addressing a beginner, I was answering a specific question of an _expert_, (Gerd), which knows everything about atomic locks and machine architecture, and was looking for an innovative and faster method in a certain limiting case (many reads, virtually no writes).bob wrote: What you did was simply a disservice to a beginner.
So I offfered one.
That you don't like that / feel the need to read the pseudo-code / do not understand it for lack of x86 knowledge, is your problem.
Do not annoy others with it.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Sync question
Please show me _any_ "falsehood" I have stated.hgm wrote:I would say that what you dis is a disservice to anyone, by your continuous spreading of falsehoods about x86 architecture. For starters, I was not addressing a beginner, I was answering a specific question of an _expert_, (Gerd), which knows everything about atomic locks and machine architecture, and was looking for an innovative and faster method in a certain limiting case (many reads, virtually no writes).bob wrote: What you did was simply a disservice to a beginner.
So I offfered one.
That you don't like that / feel the need to read the pseudo-code / do not understand it for lack of x86 knowledge, is your problem.
Do not annoy others with it.
As far as what you "offered" it is worse, as I said, from a performance perspective. And if one is not very careful, it won't even work correctly.
carry on with the nonsense.
You seem to not be able to get past the cache issues, when I have said from the get-go that this is _not_ a cache issue (out of order writes). It is simply an issue one has to account for, period.
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Sync question
That readers in an x86 SMP environment can see variables change in an other order than they are written, (as forced by sfence), of course.bob wrote:Please show me _any_ "falsehood" I have stated.
And this is yet another falsehood. Two unlocked in-order reads on an item in a shared cache line have a latency of some 7 cycles only, for a throughput of 3 instructions without pipeline stalls. i.e. one to two orders of magnitude _faster_ than an atomic lock.As far as what you "offered" it is worse, as I said, from a performance perspective. And if one is not very careful, it won't even work correctly.
That seems indeed to be your attitude. I would say: "stop with the nonsence". But it is said to deaf ears... As your statement about the performance above shows, you keep it coming...carry on with the nonsense.
You seem to not be able to get past the cache issues, when I have said from the get-go that this is _not_ a cache issue (out of order writes). It is simply an issue one has to account for, period.
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
Re: Sync question
Code: Select all
write()
{
lock();
counter++;
sfence; // not needed on x86 and amd64 (but see note)
// do write here
unlock();
}
read()
{
static int x = -1;
// do read here
lfence; // not needed on x86 and amd64 (but see note)
if (counter != x) // changed; read might be corrupted
{
lock();
// do read here
x = counter;
unlock();
}
}
See:
http://www.amd.com/us-en/assets/content ... /24593.pdf, page 164
And note that write writes A, then B, whereas read reads B, then A.
Note that the *COMPILER* is still allowed to reorder the loads and stores, so even on x86/amd64, you still actually need the barriers, even though they can be compiler barriers (volatile) and not actual hardware serialization.
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Sync question
Agreed. But an lfence does not need be very expensive on any machine, I think. If it would be unduly slow, I think the "home-made" lfence I proposed, throughGian-Carlo Pascutto wrote:So, the code needs both a sfence and lfence on weakly ordered CPUs (and will hence be slow), but it will work correctly and fast on x86 and derivates, due to the store/store and load/load ordering that is enforced.
Code: Select all
movl memA, %ecx
movl %ecx, %eax
andl $0, %ecx
movl memB(%ecx), %ebx
The original problem was formulated in such a way (reads more frequent than writes by at least a factor 10,000) that only the efficiency of reading matters. So it doesn't matter how expensive sfence is. (Although it is hard to imagine sfence is more expensive than a mutex based on xchg, as there the CPU also has to wait until the written data is "globally visible" before the read data can be made available to the CPU, or two CPUs could both grab the lock.)
Under these conditions, the locations to be read are likely to be almost always in cache for all CPUs, in the Shared state. So the normal loads would have quite low latency. Only on the occasional write all these entries would be invalidated, and the readers would have to acquire the line again from memory, or directly from their sibbling CPU. (Assuming they do not communicate through L2, as Core 2 Duo, or through L3 as Core i7.)
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
Re: Sync question
I'm not sure what this is supposed to accomplish. Either the fences are needed, and this code is buggy, or the fences aren't needed, and this code is pointless.hgm wrote:Agreed. But an lfence does not need be very expensive on any machine, I think. If it would be unduly slow, I think the "home-made" lfence I proposed, throughGian-Carlo Pascutto wrote:So, the code needs both a sfence and lfence on weakly ordered CPUs (and will hence be slow), but it will work correctly and fast on x86 and derivates, due to the store/store and load/load ordering that is enforced.
is an alternative to force reading of B to start only after the read for memA is complete, and is hardly more expensive than any normal load.Code: Select all
movl memA, %ecx movl %ecx, %eax andl $0, %ecx movl memB(%ecx), %ebx
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Sync question
OK, some examples.hgm wrote:That readers in an x86 SMP environment can see variables change in an other order than they are written, (as forced by sfence), of course.bob wrote:Please show me _any_ "falsehood" I have stated.
For each of the following, I want to execute the two following instructions:
mov A, eax
mov B, ebx
They appear exactly in the order in my program.
The _only_ way you can _guarantee_ that the the two stores are done _exactly_ in the order written is as follows:
mov A, eax
sfence
mov B, eax
sfence
Which incurs far more overhead than the original atomic lock I mentioned.
You can't get away with this:
mov A, eax
mov B, ebx
sfence
Because all the sfence instruction guarantees is that when you execute it, the processor hangs until both of the preceeding mov instructions have executed. But it does _not_ guarantee that B won't be written first. It just guarantees that before you proceed, _both_ will be written in unknown order.
I didn't notice your mentioning this. Because I don't believe you understand what is going on at this level. Which is OK, but it is not OK to just say "sfence" solves this when it most certainly does not.
So, if you want to guarantee the order of two stores, you have to put an sfence after each. And you completely stall the CPU while it is waiting on the write, and won't execute any other instructions out of order to fill in the idle cycles in the various pipelines. Lousy solution.
Otherwise you can use a lock where you write and a lock where you read, so that a reader can't access _either_ value until the lock is released. You still need an sfence (store fence) prior to clearing the lock or you can still get zapped since the instruction to clear the lock is a simple mov lock, 0 and the preceeding stores might not both be completed prior to that.
If you don't program in assembly language, you don't get access to the sfence stuff anyway, and unless you know you need it, you have subtle timing issues that are a royal pain to deal with, because they are very hard to detect.
Out of order writes are a x86 characteristic. No way around them unless you want to intersperse sfence instructions throughout your code and slow it down by at least a couple of orders of magnitude or more.
That is exactly what I have been saying all along. Peterson's code has a problem on X86 without using sfence. In fact, any program that depends on A being written before B (in the above example) has a problem unless you are willing to slow things to a crawl with sfence.
That is a simple explanation of the problem I have been hitting on from day one. I have _repeatedly_ said "this is not a cache issue" because it has _nothing_ to do with cache. If cache never sees the mov A, eax write, it can't do a thing to help us. And it won't always see that write before the write to B. Hence no order guarantee and cache can't do a thing about it.
Note also that it does not matter at all what the state of A and B are with respect to other caches or even the cache on the current processor. Neither might be cached anywyere. Both might be caches as shared everywhere. The problem still persists in exactly the same way with no help from cache to solve it at all.
As far as what you "offered" it is worse, as I said, from a performance perspective. And if one is not very careful, it won't even work correctly.
Again, you have the out-of-order problem unless you sfence _each_ write on the writer end. And that is worse, still, as I said. Locks don't stall the CPU, while sfence instructions halt everything.And this is yet another falsehood. Two unlocked in-order reads on an item in a shared cache line have a latency of some 7 cycles only, for a throughput of 3 instructions without pipeline stalls. i.e. one to two orders of magnitude _faster_ than an atomic lock.
The only thing I see is that you really don't understand the issues involved, since you haven't written this type of code before. You'll "get it" eventually, or else never write a successful parallel algorithm of any kind.That seems indeed to be your attitude. I would say: "stop with the nonsence". But it is said to deaf ears... As your statement about the performance above shows, you keep it coming...carry on with the nonsense.
You seem to not be able to get past the cache issues, when I have said from the get-go that this is _not_ a cache issue (out of order writes). It is simply an issue one has to account for, period.