Sync question

bob · Post by **bob** » Thu Jul 23, 2009 11:16 pm

hgm wrote:

Zach Wegner wrote: Is this what you mean?

Nearly so. What I meant was:

write&#40;)
&#123;
  lock&#40;);
  counter++;
  // do write here
  unlock&#40;);
&#125;
read&#40;)
&#123;
  static int x = -1;

 // do read here
  if &#40;counter != x&#41; // changed; read might be corrupted
  &#123;
    lock&#40;);
    // do read here
    x = counter;
    unlock&#40;);
  &#125;
&#125;

Not safe. What do you do in the case where the writer sets the lock, increments the counter and continues, and the reader gets the old value for the counter and continues, even though the value has actually changed?

This is "race conditions 101". Now if your algorithm doesn't fail when using an old value, even when there is a new one, then this is not an issue. But then you don't need the locks anyway. Otherwise it won't work reliably without locks and memory fences to make sure writes are done before lock is cleared.

bob · Post by **bob** » Thu Jul 23, 2009 11:22 pm

hgm wrote:
bob wrote:Has no chance of being correct. Writes are done out of order. Reads and writes are done out of order. You can read the counter, write it back, read the value, and get the "value" before you have actually written the counter back to memory. Unless you use the "lock" prefix. And then you incur the usual atomicity overhead you are trying to avoid by not using a real lock in the first place.

There are also race conditions.

You read at some point a value V1 from the counter.

A second thread prepares to modify the value so it increments the counter and modifies the value.

Unfortunately, you read the counter before it is modified, and then don'tread the new value thinking it is unmodified value because your counter read happened before the counter was stored.

This is the way to a lifelong debugging assignment.
If out-of-order execution woulld work as you describe it, it would never be possible for any program to use the memory: there is never a guarantee that you read back what you write, because the write might not have been executed yet when you read.

I am not talking about "out of order execution". I am talking about "out of order writes". Just because you first store in A, and then in B, does _not_ mean that the memory value for A gets updated first. That's the famous X86 out of order write issue. The issue you raise is a trivial one handled by the cache. The case I am talking about is much more devious in how it wrecks programs. All cache systems on X86 recognize the read after a write and get the new value since they don't need to go to memory to get it.

Fortunately CPU designers have made out-of-order execution completely transparent, and have a lot of synchronization hardware in the load/store pipes to prevent effects like you describe. Reads and writes do not overtake each other this way.

Sorry, but they do. You are simply looking at the wrong reason for it happening. If you store at memory locations A, B, C and D in that order, the order the writes are done to memory could be any of 16 possible permutations of that order.

With SMP there is the MESI protocol implemented in hardware to guarantee cache coherency. After one thread starts writing, the cache line is immediately invalidated in all other CPU caches, and there is no way any of them could start reading from it before it has reloaded that cache line, which cannot happen before the other CPU is done writing it, and ready to flush it to memory. You have to make sure that the counter is in the same cache line as the shared variable, of course.

Just listen to someone with years of experience on this topic. The CPU can't do multiple stores with one instruction. So yes, the cache is clever enough to provide a form of data integrity, but not as you are thinking. CPU A can store word 1 in a cache block, then CPU B can store word 1 and 2, then CPU A can store word 2. And now you have word 1 from CPUB and word 2 from CPUA. Nothing makes the instructions execute atomically in a group.

There might indeed be a problem with out-of-order reads, although it is unlikely that they would be scheduled for reverse ordering if neither of them has to be delayed because of a data dependency (in the addres calculation). To make sure the read of the variable is done before that of the counter even in an out-of-order machine you can give the address calculation of the counter access a (false) dependency on the result of the load, e.g.
Code: Select all
myVariable = sharedVariable;
myCounter = *(&sharedCounter + &#40;0 & myVariable&#41;);
(making sure the optimizer does not try to outsmart you, by handcoding this in assembler).

This guarantees the loading of the counter cannot evens start before the copy of the variable is in a CPU register. (But with a modest delay, and without stalling the CPU pipe.)

All I can say is "dream on". But it doesn't work as you think.

bob · Post by **bob** » Thu Jul 23, 2009 11:24 pm

hgm wrote:
Gian-Carlo Pascutto wrote:Even then, I see absolutely no reason why another CPU wouldn't be able to obtain the cache line before the second store is started.
Indeed it could, but that is no problem. If it gets the line between the store of the counter and the store of the variable it sees a modified counter, and thus cannot get at the variable before its writing is completed. Such a hybrid cache line will suimply be discareded.

see no reason why this would be true. How would the lock protect another variable than the lock itself? It won't...
The lock protects anything that is modified only in the critical section, not? In effect it makes the entire critical section atomic.

That is true, but if and only if _everybody_ uses the lock. You are trying to avoid acquiring the lock, while still discovering whether or not the value has been modified. This simply will not work.

If you acquire the lock in the reader or writer, it will work just fine if you are careful, but omitting the lock in the reader opens a window for a simple race condition.

hgm · Post by **hgm** » Thu Jul 23, 2009 11:49 pm

bob wrote:Not safe. What do you do in the case where the writer sets the lock, increments the counter and continues, and the reader gets the old value for the counter and continues, even though the value has actually changed?

That is equivalent to when the reader would have read just before the lock was set, not? That should never be a problem, because it could also happen when the read is mutexed, if the reader happens to grab the lock first.

This is "race conditions 101". Now if your algorithm doesn't fail when using an old value, even when there is a new one, then this is not an issue. But then you don't need the locks anyway. Otherwise it won't work reliably without locks and memory fences to make sure writes are done before lock is cleared.

Uncertainty about the order in which things are done by different processors is another matter than uncertainty about the order in which a single CPU does things. For the latter there is the sfence and similar instuctions. This does not lock the bus, as it pertains only to what happens in a single CPU.

As I started saying, it is essential that the writer changes the counter before it starts meddling with the variable. That was the whole idea of the scheme. That this might be slightly expensive because it needs an sfence (which is serializing and thus flushes the CPU pipe) is not important, as the writing was low anyway because of the mutex, but was supposed to almost never happen. The reading can use much cheaper methods to force in-order loading of the data, and that is what counts.

bob · Post by **bob** » Thu Jul 23, 2009 11:58 pm

hgm wrote:
bob wrote:Not safe. What do you do in the case where the writer sets the lock, increments the counter and continues, and the reader gets the old value for the counter and continues, even though the value has actually changed?
That is equivalent to when the reader would have read just before the lock was set, not? That should never be a problem, because it could also happen when the read is mutexed, if the reader happens to grab the lock first.

I do not see how your idea can work. Once the value is changed, that generally means something, somewhere. Otherwise why not just always use the old value and forget the concern. Normally, this "value" is critical to correct operation of the program. He was simply trying to avoid using a lock when reading because of the overhead, while guaranteeing that he gets the new value whenever it does change. That can't be pulled off without a traditional atomic/mutex locking mechanism. There is no safe way on X86 to guarantee that a value is current without a lock. And that is the reason for locks in the first place, to eliminate these race conditions.

Here's Gerd's original question:

Gerd wrote:Let say you have some data, for instance a cacheline, which is frequently probed by several threads (or processes), but very rarely (let say less than 0.01% of the reads) written by the same threads (or processes).

I don't want multiple parallel reads (which might occur quite often) mutual exclusive to block each other aka entering a critical section, but only write mutual exclusive versus reads - any read and write request has to wait until an ongoing write is finished, any write request has to wait until all pending reads (or one write) are finished. Is there any standard pattern to implement such a kind of Mutex or Semaphore?

He is trying to avoid the locks around the reads, while guaranteeing that when a value is changed, he gets the new value, by somehow forcing all reads to wait for pending writes (which processors don't even know about) and vice-versa.

It's a reasonable idea. And we have had hardware approaches in the past that would allow you to pull this off (the old fetch&add architecture could do this). But just not X86.

This is "race conditions 101". Now if your algorithm doesn't fail when using an old value, even when there is a new one, then this is not an issue. But then you don't need the locks anyway. Otherwise it won't work reliably without locks and memory fences to make sure writes are done before lock is cleared.
Uncertainty about the order in which things are done by different processors is another matter than uncertainty about the order in which a single CPU does things. For the latter there is the sfence and similar instuctions. This does not lock the bus, as it pertains only to what happens in a single CPU.

Correct. But the issue at hand is what happens across _all_ processors. And here writes inside a CPU are done out of order, and certainly writes across the processors abound in race conditions and are done in any possible order. BTW all the fence does is say "do not proceed beyond this point until all writes are done." That still doesn't prevent intervening reads from getting some old and some new values on other processors/caches...

hgm · Post by **hgm** » Fri Jul 24, 2009 9:22 am

Well, as Gerd did not specify what exactly he wants to use this protectged variable for, it is difficult to decide if the scheme I proposed fits his needs.

What I had in mind was the protection of a multi-word variable, to make sure that no reader could ever get a corrupted value, a hybrid from different interleaved writes or of the old and new value of a single write.

If there is something that must be done after a read of the variable before it changes, then there would indeed be a necessity to lock out writers. But I don't read that in Gerd's question.

And sure, it is always possible that readers get a combination of old and new data. But the crux of my scheme is that there is one variable (the counter) where they can see if it is corrupted or not, and since that variable is first written and last read, they can use it as a guarantee on all the other stuff they have read. And forcing the read and write ordering are single-CPU issues, not requiring a lock.

Aleks Peshkov · Post by **Aleks Peshkov** » Fri Jul 24, 2009 12:22 pm

I do not know how the thought is callled, but it seems obvious.

What about reading the same complex value twice? If the result is different repeat until read is consistent.

The same is for writing stage: reread the result of write, until it is synchronized with what is required to write.

Bo Persson · Post by **Bo Persson** » Fri Jul 24, 2009 1:11 pm

Gerd Isenberg wrote:
Bo Persson wrote: This might work with later versions of MSVC (with an extension to the volatile semantics) on x86 hardware. On most everything else, you have race conditions in the while loops, as well as no guarantees about updates happening in order on other CPU chips.
What is the MSVC extension to the volatile semantics? Is it related to read- and write barriers to ensure the order of instructions?

Later versions (from VC2005 or so) has a guarantee that not only are the volatile objects read and written correctly, but the optimizer will also NOT move other non-volatile reads and writes past the volatile ones. Otherwise, reading the data you are trying to protect might be hoisted out of the loop.

http://msdn.microsoft.com/en-us/library/12a04hfd.aspx

Again, this might improve correctness but also limits what the optimizer will do to the code!

hgm · Post by **hgm** » Fri Jul 24, 2009 4:18 pm

Taking another look at the description of the load/store unit makes it very unlikely an sfence instruction is needed here:

http://www.anandtech.com/cpuchipsets/in ... i=2748&p=5

This confirms that neither AMD nor Intel architectures allow stores to pass each other (as I already thought). So if you write A before B there should never be a situation where your cache contains an old A and an updated B. So no matter when these changes become "globally visible", it should not be possible for a reader to obtain a cache line where B has changed and A not.

Sfence is not needed; it does much more than needed. It delays the second store until after others could have seen the first. We don't care about that here; it is perfectly OK to change the the second even before other readers could have seen the change of the first. As long as they could not see them change in the wrong order eventually. Which should be the case if they are written in order into the same cache line.

rjgibert · Post by **rjgibert** » Fri Jul 24, 2009 6:34 pm

Aleks Peshkov wrote:I do not know how the thought is callled, but it seems obvious.

What about reading the same complex value twice? If the result is different repeat until read is consistent.

The same is for writing stage: reread the result of write, until it is synchronized with what is required to write.

I thought of this for doing reads, but then I thought of non-atomic write sequences that would confuse reads e.g 1,2 --> 3,1 --> 1,4 --> 5,1 could repeatedly be read as 1,1 a set values that was never intended.

Edit: I should add that I also thought of a probabilistic fix. You would include a checksum in your set of values. Then you would repeat reads until the checksum verified the other values read as correct. BTW, this would only be workable if the set of values to be read is large enough to make the checksum reliable.

Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question