Questions on volatile keyword and memory barriers

frankp · Post by **frankp** » Mon Aug 17, 2009 9:46 pm

bob wrote: volatile is a compiler directive, and has nothing to do with the architecture. It simply says "this value can change spontaneously". This instructs the compiler to re-read the value from memory each time it is accessed, rather than trying to optimize and keep the value in a register. This is an issue in two common places. One is where you have a threaded/parallel application and another thread can change a value you are occasionally checking. The compiler doesn't know about the concept of threading, so it looks at your source and sees a load here from X, then a little further it sees another load from X, and concludes "OK, X is not modified between the first and second load, so I can use the first value. not good for parallel algorithms.
Hope that helps...

So globals like fAbort which can be set by two or more threads should be volatile e.g.

if (fAbort)
return (0);
value = -ABsearch();
if (fAbort)
return (0);
if (potential pv node)
value = -ABsearch(); //research.

wgarvin · Post by **wgarvin** » Mon Aug 17, 2009 10:02 pm

Gian-Carlo Pascutto wrote:
bob wrote: It is a volatile function which is a different issue that says "everything done prior to this function has to be completed before this function is executed."
Yes, so it's a barrier. Note that the volatile in there is purely a GCC convention and doesn't have anything to do with volatile variables.

I think Bob misspoke when he called it a "volatile function", but I think he's correct about the issue. The compiler doesn't understand what semantics your asm block might have, so it is constrained in how much optimization it can do to the surrounding code. In this case, it happens to make the compiler act like there is a barrier there. But you can't rely on that in general.

Barriers are non-portable. If you need one, then you have to put one explicitly. It might be a hardware instruction (telling the hardware to wait for the earlier instructions to complete before doing the later ones), a compiler intrinsic (telling the compiler not to rearrange the order of the loads or stores across the barrier) or both (i.e. it might be a compiler intrinsic that also emits an actual hardware barrier/fence instruction).

Also, whether or not you need a barrier at all, and exactly what type of barrier you need in a certain situation, is going to depend on what guarantees your hardware architecture offers about its memory ordering. In general, "volatile" is not enough to meet any of the ordering requirements for synchronization between two threads (unless your compiler adds extra semantics to "volatile" accesses; I think Microsoft does this, but some other x86 compilers do not).

bob wrote: volatile is a compiler directive, and has nothing to do with the architecture. It simply says "this value can change spontaneously". This instructs the compiler to re-read the value from memory each time it is accessed, rather than trying to optimize and keep the value in a register. This is an issue in two common places. One is where you have a threaded/parallel application and another thread can change a value you are occasionally checking. The compiler doesn't know about the concept of threading, so it looks at your source and sees a load here from X, then a little further it sees another load from X, and concludes "OK, X is not modified between the first and second load, so I can use the first value. not good for parallel algorithms.
Hope that helps...

You forgot to mention the other common place.

I guess that it's programming for embedded devices (or device drivers for hardware in general) where memory addresses might be mapped to a hardware port instead of RAM. That's actually what volatile was invented for in the first place -- C is a popular language for that kind of low-level programming.

Gian-Carlo Pascutto · Mon Aug 17, 2009 10:57 pm

wgarvin wrote: I think Bob misspoke when he called it a "volatile function", but I think he's correct about the issue. The compiler doesn't understand what semantics your asm block might have, so it is constrained in how much optimization it can do to the surrounding code. In this case, it happens to make the compiler act like there is a barrier there. But you can't rely on that in general.

For the third time, the code I pasted is guaranteed to act as a compiler barrier (and on x86, due to ordering guarantees, as a full one) and can be perfectly relied on to do this.

You're the second person that now argues "it's not a barrier" even though you yourself already admit that that is exactly what it is.

It effectively tells the compiler: I modified an undefined part of the system memory, all bets are off as to what is in RAM. MSVC has similar functions. This just happens to be the way you write one in GCC.

Barriers are non-portable.

This is true. However, POSIX locking functions are guaranteed to act as barriers. So, if you use those, you don't need volatile at all, because those barriers will force reads from memory.

bob · Post by **bob** » Mon Aug 17, 2009 11:09 pm

Gian-Carlo Pascutto wrote:
bob wrote: It is a volatile function which is a different issue that says "everything done prior to this function has to be completed before this function is executed."
Yes, so it's a barrier. Note that the volatile in there is purely a GCC convention and doesn't have anything to do with volatile variables.

Understood. But it is not portable. As I mentioned, I tried this on an alpha 21164 we are fixing to get rid of, and this did not work there. It does solve the optimization issue since it is gcc, but it doesn't emit any sort of barrier instruction to deal with the out-of-order-writes, which breaks this.

If you had been on Windows, I'd have put _ReadBarrier and there would have been no volatile word in there at all.

All it does as far as I can tell is limit the optimizer such that nothing can be moved beyond the call to the volatile function.
Yes, which is why it's called a barrier.

I do not call that a "barrier". The concept of a barrier goes beyond the compiler optimizer and actually is implemented in the hardware for architectures that support out-of-order reads/writes. Something that only works on some architectures is not much of a solution, since it invites too many bugs to the party when moved to something different or the architecture changes.

Just look in the Linux kernel that you're running how barriers are implemented...

as you quoted. For X86. Which was my point. But not _all_ processors are x86 or x86-64.

Which is safe on X86. But if your architecture does out of order writes, you will be beyond screwed unless the compiler is smart enough to put a real barrier/fence instruction immediately preceeding that function call. I tried this on an old alpha box and it didn't, but I am pretty sure this is not the latest gcc on that machine...
Newer gcc versions have sync_synchronize to emit a hardware dependent barrier. I'm not aware of any portable way to emit one with older gcc's. The code I posted is ok for x86 but you want to add a "fence" or whatever is the equivalent to the asm block for other architectures.
For the code posted that is not even needed, it just needs v to come from memory instead of a register.

But anyway, case proven: you can get the code you posted working perfectly without using volatile and with only barriers.

Again, "for some architectures". That is a big qualifier. Since X86 has no hardware barrier requirement, and since this is purely a compiler kludge to tell the peephole optimizer it can not look beyond this point, it does work. Until this is moved to an alpha or whatever...

I am _really_ trying to make everything I write work on any processor, by not depending on hardware in-order writes, etc. Too many do out-of-order even including the old crays.

bob · Post by **bob** » Mon Aug 17, 2009 11:17 pm

Gian-Carlo Pascutto wrote:
wgarvin wrote: I think Bob misspoke when he called it a "volatile function", but I think he's correct about the issue. The compiler doesn't understand what semantics your asm block might have, so it is constrained in how much optimization it can do to the surrounding code. In this case, it happens to make the compiler act like there is a barrier there. But you can't rely on that in general.
For the third time, the code I pasted is guaranteed to act as a compiler barrier (and on x86, due to ordering guarantees, as a full one) and can be perfectly relied on to do this.

You're the second person that now argues "it's not a barrier" even though you yourself already admit that that is exactly what it is.

What _both_ of us said is "this is not a barrier on all architectures... only on architectures where (a) memory writes are done in order and (b) the compiler uses the funky gcc interpretation of the volatile/__volatile__ modifier to the asm() directive. Here's my lock I have been using for years, even modified to work properly with hyperthreading:

Code: Select all

static void __inline__ LockX86&#40;volatile int *lock&#41; &#123;
  int dummy;
  asm __volatile__(
      "1&#58;          movl    $1, %0"   "\n\t"
      "            xchgl   (%1&#41;, %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jz      3f"       "\n\t"
      "2&#58;          pause"            "\n\t"
      "            movl    (%1&#41;, %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jnz     2b"       "\n\t"
      "            jmp     1b"       "\n\t"
      "3&#58;"                           "\n\t"
      &#58;"=&q"&#40;dummy&#41;
      &#58;"q"&#40;lock&#41;
      &#58;"cc", "memory");
&#125;

But what about compilers that don't underestand that particular nuance? And more importantly, what about architectures where that is not enough to guarantee correctness such as the alpha? That was the point I was trying to make.

Designing something to work on a specific architecture is OK if you are certain that is all you will use, and that the architecture will never change. One day X86 will likely do out-of-order writes. Because a fence can be used to force order when needed, and out-of-order offers more performance. And when that happens, this simple "volatile barrier" will fail to work.

It effectively tells the compiler: I modified an undefined part of the system memory, all bets are off as to what is in RAM. MSVC has similar functions. This just happens to be the way you write one in GCC.

Barriers are non-portable.
This is true. However, POSIX locking functions are guaranteed to act as barriers. So, if you use those, you don't need volatile at all, because those barriers will force reads from memory.

bob · Post by **bob** » Mon Aug 17, 2009 11:21 pm

frankp wrote:
bob wrote: volatile is a compiler directive, and has nothing to do with the architecture. It simply says "this value can change spontaneously". This instructs the compiler to re-read the value from memory each time it is accessed, rather than trying to optimize and keep the value in a register. This is an issue in two common places. One is where you have a threaded/parallel application and another thread can change a value you are occasionally checking. The compiler doesn't know about the concept of threading, so it looks at your source and sees a load here from X, then a little further it sees another load from X, and concludes "OK, X is not modified between the first and second load, so I can use the first value. not good for parallel algorithms.
Hope that helps...
So globals like fAbort which can be set by two or more threads should be volatile e.g.

if (fAbort)
return (0);
value = -ABsearch();
if (fAbort)
return (0);
if (potential pv node)
value = -ABsearch(); //research.

Yes. Here's the simple explanation:

(1) if a simple variable can be modified by two threads independently, it needs to be volatile. If it is only read by both, it is not needed. But if multiple threads will access the value, and at least one also writes the value, volatile is necessary.

(2) if a pointer to a value is used, and the value itself is modified by at least one thread and accessed by at least one other thread, then you need to use the volatile keyword in front of the variable name.

(3) if a pointer is used, and the pointer itself will be used by more than one thread and potentially modified by at least one, then the pointer itself needs to be volatile.

(4) if a pointer is used, and the pointer itself is modified/used by more than one thread, and the thing it points to is also modified/used by more than one thread, then both need to be volatile. examples:

(1) volatile int a;

(2) int * volatile b;

(3) volatile int *c;

(4) volatile int * volatile d;

For the above four cases. If you get those wrong, you get to do some really difficult debugging.

frankp · Post by **frankp** » Mon Aug 17, 2009 11:29 pm

Thanks! I pasted this as a comment into my code to remind me.

wgarvin · Post by **wgarvin** » Tue Aug 18, 2009 4:30 pm

Here's an article from 2005 that I found with a quick google search:
http://www.linuxjournal.com/article/8211

If you scroll about half way down there's an interesting chart showing what memory reordering effects you have to cope with for each of the major hardware architectures.

Notice the first row, Alpha, which has every kind of reordering including the reordering of dependent loads (!). This wide variety of semantics across different hardware is one of the reasons its so hard to write correct, portable multithreaded code (especially if you make your own synchronization primitives rather than just using semaphore/mutex implementations supplied by the OS).

Also notice that their AMD64 row doesn't have a checkmark for "loads reordered after stores" but the x86 and x86 OOStore rows do have that checkmark. So you could easily end up with code that has a race condition between a load and a store, and if you do all your testing on an AMD chip the issue will never be exposed. But on Intel, you might get weird crashes or erroneous behaviour, and it might occur rarely (e.g. once in 10,000 games) making it very difficult to track down.

Writing "lock-free" primitives is a difficult task. You should arm yourself with as much knowledge as possible, design carefully, and test thoroughly.

bob · Post by **bob** » Tue Aug 18, 2009 8:59 pm

wgarvin wrote:Here's an article from 2005 that I found with a quick google search:
http://www.linuxjournal.com/article/8211

If you scroll about half way down there's an interesting chart showing what memory reordering effects you have to cope with for each of the major hardware architectures.

Notice the first row, Alpha, which has every kind of reordering including the reordering of dependent loads (!). This wide variety of semantics across different hardware is one of the reasons its so hard to write correct, portable multithreaded code (especially if you make your own synchronization primitives rather than just using semaphore/mutex implementations supplied by the OS).

Also notice that their AMD64 row doesn't have a checkmark for "loads reordered after stores" but the x86 and x86 OOStore rows do have that checkmark. So you could easily end up with code that has a race condition between a load and a store, and if you do all your testing on an AMD chip the issue will never be exposed. But on Intel, you might get weird crashes or erroneous behaviour, and it might occur rarely (e.g. once in 10,000 games) making it very difficult to track down.

Writing "lock-free" primitives is a difficult task. You should arm yourself with as much knowledge as possible, design carefully, and test thoroughly.

This is why so many architectural discussions break down into arguments and such, here. Many have severe tunnel vision, and only consider X86-64. That is not the only architecture around for some of us. And writing parallel algorithms that work on _all_ architectures is a plus. And then there are some minor inconsistencies in the X86-64 family as you pointed out, as well.

Our alpha is now officially retired and on the way to wherever retired machines go around here. But the barrier discussion was simply one case in point where something works fine on x86/x86-64 but fails miserably on another architecture like the alpha, where a real fence instruction needs to be added to that barrier function, the volatile trick is not good enough there.

When Tim first got Crafty up on his dual-cpu alpha workstation, he pointed this out (out of order writes needed some help on a lock function, and this help was not, at the time, a built-in intrinsic using DEC's C compiler. I simply had alpha-specific lock.h stuff to solve it. Would be nice if the POSIX people had woken up from their stupor and realized that many of us don't want the inefficient MUTEX stuff in pthreads, that we prefer a real spin-lock instead. A POSIX spinlock would make most of this moot since it could have a fence/barrier/whatever when required and not need any special attention by the programmers.

Zach Wegner · Post by **Zach Wegner** » Tue Aug 18, 2009 9:47 pm

bob wrote:Our alpha is now officially retired and on the way to wherever retired machines go around here. But the barrier discussion was simply one case in point where something works fine on x86/x86-64 but fails miserably on another architecture like the alpha, where a real fence instruction needs to be added to that barrier function, the volatile trick is not good enough there.

So what's the problem? Just add the fence instruction. Then you don't need volatile anymore.

Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers