Sync question

bob · Post by **bob** » Wed Jul 22, 2009 1:27 am

Gerd Isenberg wrote:
Zach Wegner wrote: 3. Do a complicated compare-and-swap on a read. Most of the time, the lock isn't held, so you can read freely. But you still have to increment a counter, so that we only write when there are no readers. I suppose it would look something like this:
Code: Select all
read&#40;)
&#123;
  // r is how many readers are reading now, or -1 if someone is writing
  r = lock;
  while &#40;r >= 0 && !cmpxchg&#40;lock,r,r+1&#41;)
    r = lock;
  
  // do read here
  
  ASSERT&#40;lock>=1&#41;;
  r = lock;
  while (!cmpxchg&#40;lock,r,r-1&#41;)
    r = lock;
&#125;
write&#40;)
&#123;
  // acquire lock
  r = lock;
  while &#40;r == 0 && !cmpxchg&#40;lock,0,-1&#41;)
    r = lock;

  // do write here

  ASSERT&#40;lock==-1&#41;;
  lock = 0;
&#125;
Ok, that is what I wanted. I thought about a distinct boolean as write indicator and a read counter. But one variable as combined read and write indicator is sufficient (like using MSB as boolean write indicator) and avoids headache

That probably _isn't_ what you want. That xchg operation will cause a ton of cache forwarding, exactly what you don't want as it kills performance.

Zach Wegner · Post by **Zach Wegner** » Wed Jul 22, 2009 2:27 am

bob wrote:That probably _isn't_ what you want. That xchg operation will cause a ton of cache forwarding, exactly what you don't want as it kills performance.

And how would you suggest a way to do it such that you can copy more than a word, and the data is never corrupted? I think having a "lockless" algorithm here is just about as good as it can get. You just have to accept the losses in keeping the data intact.

Of course, if there is an atomic decrement instruction available, you can use that at the end of the read operation, since you don't need to test the value.

I'm trying to think of a way to formulate this with a fetch-and-add, so that the readers only have to spin when there is a reader, not when there are simply a bunch of readers trying to increment the semaphore...

Zach Wegner · Post by **Zach Wegner** » Wed Jul 22, 2009 2:49 am

OK, I'm pretty sure this works. If it does, it's certainly bad ass!

It needs the compare-and-swap and a fetch-and-add-2. I want to say I've seen the fetch-and-add-2 instruction, but I can't find it ATM.

This is sort of based on the lock here: http://en.wikipedia.org/wiki/Fetch-and-add

The lock has two counters, A and B. A is twice the number of readers that have started to read, and B is twice the number who have finished reading. Each counter is monotonically increasing (except on wraparound) and A is always >= B (again, except if A wraps around). Since our counts are multiplied by 2, we have the low bit free to act as a write lock.

Code: Select all

read&#40;)
&#123;
  // get a read ticket
  r = fetchadd&#40;lock.A, 2&#41;;
  
  // somebody's writing, wait for them to change the counter back.
  while &#40;lock.A&1&#41;
    ;

  // do read here

  // signal that we're done. if A==B now, there are no readers.
  atomicadd&#40;lock.B, 2&#41;;
&#125;
write&#40;)
&#123;
  // acquire write lock. if A==B, then there are no readers.
  // if A is still B when we swap, then the low bit of A is set, so A=B+1, and we can write.
  // B will never have the low bit set, so no two writers can never simultaneously write.
  // The readers can still try to read here, but each one gets an odd value. They just wait
  // until the A value is decremented, and then they go. This gives a higher priority to readers,
  // so that readers will only have to wait for at most one write.
  r = lock.B;
  while (!cmpxchg&#40;lock.A, r, r+1&#41;)
    ;

  // do write here

  // unset the low bit
  atomicadd&#40;lock.A, -1&#41;;
&#125;

bob · Post by **bob** » Wed Jul 22, 2009 6:14 am

Zach Wegner wrote:
bob wrote:That probably _isn't_ what you want. That xchg operation will cause a ton of cache forwarding, exactly what you don't want as it kills performance.
And how would you suggest a way to do it such that you can copy more than a word, and the data is never corrupted? I think having a "lockless" algorithm here is just about as good as it can get. You just have to accept the losses in keeping the data intact.

Of course, if there is an atomic decrement instruction available, you can use that at the end of the read operation, since you don't need to test the value.

I'm trying to think of a way to formulate this with a fetch-and-add, so that the readers only have to spin when there is a reader, not when there are simply a bunch of readers trying to increment the semaphore...

Lockless is fine, but the xchg operation is not fine, since it is locked, and that involves the cache forwarding and a ton of cache traffic. You can't do fetch and add without a lock. You can use the inc instruction with a lock prefix, but you now have exactly the same overhead with cache forwarding. Much better to come up with ideas that do _not_ need "atomic operations", which for Intel means no lock prefix, explicit or implied...

Atomic instructions are just a bad idea overall, particularly if you are going to be doing them frequently as in incrementing/decrementing some sort of counter.

bob · Post by **bob** » Wed Jul 22, 2009 6:17 am

Zach Wegner wrote:OK, I'm pretty sure this works. If it does, it's certainly bad ass! It needs the compare-and-swap and a fetch-and-add-2. I want to say I've seen the fetch-and-add-2 instruction, but I can't find it ATM.

This is sort of based on the lock here: http://en.wikipedia.org/wiki/Fetch-and-add

The lock has two counters, A and B. A is twice the number of readers that have started to read, and B is twice the number who have finished reading. Each counter is monotonically increasing (except on wraparound) and A is always >= B (again, except if A wraps around). Since our counts are multiplied by 2, we have the low bit free to act as a write lock.
Code: Select all
read&#40;)
&#123;
  // get a read ticket
  r = fetchadd&#40;lock.A, 2&#41;;
  
  // somebody's writing, wait for them to change the counter back.
  while &#40;lock.A&1&#41;
    ;

  // do read here

  // signal that we're done. if A==B now, there are no readers.
  atomicadd&#40;lock.B, 2&#41;;
&#125;
write&#40;)
&#123;
  // acquire write lock. if A==B, then there are no readers.
  // if A is still B when we swap, then the low bit of A is set, so A=B+1, and we can write.
  // B will never have the low bit set, so no two writers can never simultaneously write.
  // The readers can still try to read here, but each one gets an odd value. They just wait
  // until the A value is decremented, and then they go. This gives a higher priority to readers,
  // so that readers will only have to wait for at most one write.
  r = lock.B;
  while (!cmpxchg&#40;lock.A, r, r+1&#41;)
    ;

  // do write here

  // unset the low bit
  atomicadd&#40;lock.A, -1&#41;;
&#125;

There are two angles here. "works" and "works fast". The above will work as far as I can tell with a quick glance. I assume the fetchandadd is going to involve an exchange instruction however, so performance is going to be bad due to the inter-cache traffic this will produce.

The idea is to avoid "atomic" operations as much as possible, those are the ones that kill. You may as well use a normal lock as the overhead will be the same.

Bo Persson · Post by **Bo Persson** » Wed Jul 22, 2009 12:14 pm

Gerd Isenberg wrote:
bob wrote:
Gerd Isenberg wrote:Let say you have some data, for instance a cacheline, which is frequently probed by several threads (or processes), but very rarely (let say less than 0.01% of the reads) written by the same threads (or processes).

I don't want multiple parallel reads (which might occur quite often) mutual exclusive to block each other aka entering a critical section, but only write mutual exclusive versus reads - any read and write request has to wait until an ongoing write is finished, any write request has to wait until all pending reads (or one write) are finished. Is there any standard pattern to implement such a kind of Mutex or Semaphore?

Thanks,
Gerd
No such animal. You have two choices;

(1) use mutex locks so that reads can't happen while writes are in progress, and vice-versa. If this is a common action, it will kill performance.

(2) use some other trick such as the "lockless hash" approach used in Crafty so that you can determine that the data has been broken without needing a lock.

You didn't mention how "big" the data item is. If it is a single word, you are in good shape to control things. If it is multiple words, then out-of-order writes can become a killer. You can always separate the things so that they don't fit in the same cache line. Or if you can modify your code so that the write won't wreck things if you read an old and a new value at the same time, you can ignore it. Or you can take action as given in (1) or (2).

You can't do a "write lock" because you can't allow writes to happen while reads are in progress, and you can't do read locks as they will interfere with each other for no good reason and kill performance.
I had something like this in mind, which might end inside a dead lock:
Code: Select all
char CACHE_ALIGN gloabalData&#91;64&#93;;

volatile int ReadCount = 0;
volatile bool isWritePending = false;

void read &#40;void *pLocalBuffer&#41; &#123;
   ++ReadCount;
   while &#40;isWritePending&#41;; // wait until write finished
   memcpy &#40;pLocalBuffer, gloabalData, 64&#41;;
   --ReadCount;
&#125;

void write &#40;void *pLocalBuffer&#41; &#123;
    while &#40;isWritePending || lockexchange &#40;isWritePending, true&#41;);
    while ( readCount );  // wait for pending reads 
    memcpy &#40;gloabalData, pLocalBuffer, 64&#41;;
    isWritePending = false;
&#125;

This might work with later versions of MSVC (with an extension to the volatile semantics) on x86 hardware. On most everything else, you have race conditions in the while loops, as well as no guarantees about updates happening in order on other CPU chips.

Isn't this the Double-Checked Locking problem?

http://www.cs.umd.edu/~pugh/java/memory ... cking.html

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jul 22, 2009 6:57 pm

Bo Persson wrote: This might work with later versions of MSVC (with an extension to the volatile semantics) on x86 hardware. On most everything else, you have race conditions in the while loops, as well as no guarantees about updates happening in order on other CPU chips.

What is the MSVC extension to the volatile semantics? Is it related to read- and write barriers to ensure the order of instructions?

Bo Persson wrote:Isn't this the Double-Checked Locking problem?

http://www.cs.umd.edu/~pugh/java/memory ... cking.html

I don't know whether my initial question is exclusively related to Double-Checked Locking problem (interesting link btw., but takes some time to work through), or whether Double-Checked Locking refers to the conjunction of a normal read and atomic exchange to enter a critical section.

bob · Post by **bob** » Wed Jul 22, 2009 8:43 pm

Gerd Isenberg wrote:
Bo Persson wrote: This might work with later versions of MSVC (with an extension to the volatile semantics) on x86 hardware. On most everything else, you have race conditions in the while loops, as well as no guarantees about updates happening in order on other CPU chips.
What is the MSVC extension to the volatile semantics? Is it related to read- and write barriers to ensure the order of instructions?
Bo Persson wrote:Isn't this the Double-Checked Locking problem?

http://www.cs.umd.edu/~pugh/java/memory ... cking.html
I don't know whether my initial question is exclusively related to Double-Checked Locking problem (interesting link btw., but takes some time to work through), or whether Double-Checked Locking refers to the conjunction of a normal read and atomic exchange to enter a critical section.

My advice, based on a _bunch_ of years doing this stuff is to faithfully adhere to the following:

(1) avoid locks when possible _AND_ when correctness is assured.

(2) avoid tricks that look reasonable, until you are absolutely certain that they are functionally correct. race conditions are very difficult to recognize in many cases, and introducing this kind of a bug guarantees you many years of enjoyable debugging experience.

If you are worried about a structure that can be written to from several locations, use a lock. Or something equally safe (lockless hashing idea for example) so that the lack of a lock causes an error that does not hurt your program. Read locks and a write lock are simply a bad idea. You end up killing performance, and introducing some of the most subtle race/interleaved-update issues known to man.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jul 22, 2009 9:08 pm

bob wrote:My advice, based on a _bunch_ of years doing this stuff is to faithfully adhere to the following:

(1) avoid locks when possible _AND_ when correctness is assured.

(2) avoid tricks that look reasonable, until you are absolutely certain that they are functionally correct. race conditions are very difficult to recognize in many cases, and introducing this kind of a bug guarantees you many years of enjoyable debugging experience.

If you are worried about a structure that can be written to from several locations, use a lock. Or something equally safe (lockless hashing idea for example) so that the lack of a lock causes an error that does not hurt your program. Read locks and a write lock are simply a bad idea. You end up killing performance, and introducing some of the most subtle race/interleaved-update issues known to man.

Thanks, your advice is very much appreciated. I wondered whether this issue, very frequent often really "parallel" multiple reads, but very rare writes, need so much overhead inside the frequent reads, and could not implemented in favor to the reads by a well tested, portable standard pattern.

bob · Post by **bob** » Wed Jul 22, 2009 9:13 pm

Gerd Isenberg wrote:
bob wrote:My advice, based on a _bunch_ of years doing this stuff is to faithfully adhere to the following:

(1) avoid locks when possible _AND_ when correctness is assured.

(2) avoid tricks that look reasonable, until you are absolutely certain that they are functionally correct. race conditions are very difficult to recognize in many cases, and introducing this kind of a bug guarantees you many years of enjoyable debugging experience.

If you are worried about a structure that can be written to from several locations, use a lock. Or something equally safe (lockless hashing idea for example) so that the lack of a lock causes an error that does not hurt your program. Read locks and a write lock are simply a bad idea. You end up killing performance, and introducing some of the most subtle race/interleaved-update issues known to man.
Thanks, your advice is very much appreciated. I wondered whether this issue, very frequent often really "parallel" multiple reads, but very rare writes, need so much overhead inside the frequent reads, and could not implemented in favor to the reads by a well tested, portable standard pattern.

The problem is that reads don't prevent writes unless you use a lock. And that introduces a ton of overhead if this is common. For example, locks are nearly unusable on hash tables because of the frequency of accessing them...

The loose memory model used on X86 is another issue where writes are not guaranteed to occur in the order our programs specify, further confusing things.

Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question