volatile?

Rein Halbersma · Post by **Rein Halbersma** » Fri Mar 21, 2014 10:20 am

hgm wrote:Note that even a single INC mem instruction is not atomic on i386/x64. It still involves reading and then writing back the data in separate steps of the micro-architecture, and other cores could read or write that same memory location in between. Only with a LOCK prefix the instruction access to memory by other cores will be blocked between the read and the write.

As to the #include of the code, this still puzzles me. I can of course see that this help the compiler to se what the routines do, and thus which global variables run the risk of being changed, and which are safe. But when I #include a file that really defines a routine in more than one of my source files, I usually get a 'multiply-defined symbol' linker error. How is this prevented, in this case?

Just put the "inline" keyword in front of a function definition inside a header.

syzygy · Post by **syzygy** » Fri Mar 21, 2014 11:42 am

lucasart wrote:OK, that means for built-in types read/write operations are atomic. This is because built-in types have a size that divides the cash line size. Hence, in the absence of unaligned memory access (which you would really have to provoke on purpose with some ugly C-tyle reinterpretation of pointers), you are guaranteed that they don't cross a cache line.

Yes, but remember that this is platform-dependent. It might be true for all platforms we know, but it is not mandated by C/C++ or POSIX (actually it is not true for all platforms we know: uint64 is a built-in type on 32-bit x86, but not atomic).

For example, I'm wondering if in this line of code:
https://github.com/lucasart/Sensei/blob ... i.cc#L5861
I can remove the lock protection.

If I can assume that 'p_workers++' is an atomic operation,

It is not atomic. Whether it is implemented as a single x86 increment instruction or as separate loads and stores, it is not atomic.

If you want atomic increment, you need to use inline assembly to generate a "lock inc" instruction, or gcc intrinsics (e.g. __atomic_fetch_add), or whatever the language provides (at least C++11 and I suppose C11 do).

Is there anything in the C++ standard that forbids 2/, and guarantees that the incrementation will be atomic?

The C++ standard allows both and neither is atomic.
The C++ standard before C++11 does not know about multithreading so has no notion of atomicity. It guarantees nothing at all as multithreading is concerned.

I don't think POSIX/pthreads makes any atomicity guarantees, but I might be wrong. Certainly it does not guarantee that p_workers++ is atomic.

C++11 provides atomic primitives, but does not guarantee that p_workers++ is atomic.

Should I define the variable p_workers as std::atomic<int> in order to get this guarantee?

I'm afraid that would only guarantee that reads and writes are atomic, not that ++ is atomic.

syzygy · Post by **syzygy** » Fri Mar 21, 2014 11:55 am

hgm wrote:As to the #include of the code, this still puzzles me. I can of course see that this help the compiler to se what the routines do, and thus which global variables run the risk of being changed, and which are safe. But when I #include a file that really defines a routine in more than one of my source files, I usually get a 'multiply-defined symbol' linker error. How is this prevented, in this case?

If you mean #include <pthread.h>, you do not have to worry what happens below the level of the source code you have typed. If your system complies with POSIX, and you stick to the rules (i.e. #include and link in the proper way and not copy & paste from the library source), then there is no need to make variables volatile in order to prevent optimisations from introducing bugs.

In the meantime I have understood better why "volatile" and concurrency are completely orthogonal concepts. "volatile" forces the compiler to reload values from memory, but gives no guarantee whatsoever (at the C/C++ standard level) that what you read is the value that has been written by another thread. It would be perfectly fine if the value returned is the local value in the processor's cache. volatile does not enforce cache coherency.

On a POSIX-compliant system, there is a guarantee that certain primitives synchronise memory across threads (or at least work "as if" memory is synchronised at these points).

Tom Likens · Post by **Tom Likens** » Fri Mar 21, 2014 12:55 pm

lucasart wrote: The reason I ask, is because I was wondering what this obscure "volatile" keyword really means. I read this short article:
https://www.kernel.org/doc/Documentatio ... armful.txt
Essentially they make the point that
In properly-written code, volatile can only serve to slow things down
They say that volatile has nothing to do with concurrency, it's almost never correct to use it, and the lock is enough.

On the other hand, Stockfish also declares all shared variables as volatile. And I know that Marco is much more knowledgeable than I am in C++, especially when it comes to multi-threading. So I can't help wondering if there isn't indeed a good reason for all this volatile stuff

It's for hardware access. When you have program memory mapped in the hardware space (i.e. memory that can be accessed by hardware registers) outside the program, the volatile keyword let's you know that the value of the register could change out from under you.

We use it all the time on PCB (printed circuit board) projects where we have external hardware registers that can program shared hardware/software registers and memory locations. I've never used it for a simple program only situation and see no real use for it if external hardware isn't involved.

regards,
--tom

hgm · Post by **hgm** » Fri Mar 21, 2014 1:21 pm

syzygy wrote:"volatile" forces the compiler to reload values from memory, but gives no guarantee whatsoever (at the C/C++ standard level) that what you read is the value that has been written by another thread. It would be perfectly fine if the value returned is the local value in the processor's cache. volatile does not enforce cache coherency.

Hardware automatically forces cache coherency. There is nothing the compiler has to or can do about that. Returning the local value in the core's private cache is always fine. If that wasn't the currently valid value (because it was changed in DRAM, a shared higher cache level or some other core's private cache), it would be no longer in your cache.

syzygy · Post by **syzygy** » Fri Mar 21, 2014 1:39 pm

hgm wrote:
syzygy wrote:"volatile" forces the compiler to reload values from memory, but gives no guarantee whatsoever (at the C/C++ standard level) that what you read is the value that has been written by another thread. It would be perfectly fine if the value returned is the local value in the processor's cache. volatile does not enforce cache coherency.
Hardware automatically forces cache coherency. There is nothing the compiler has to or can do about that.

Maybe your hardware does that, but in general it does not. Certainly the various standards do not require any automatic enforcement of cache coherency.

I believe the x86 architecture has a memory model that is so software friendly (and hardware unfriendly) that the pthreads library does not have to do anything special. On other architectures this is certainly different, e.g. memory writes by one thread may be observed by other threads out of order. The pthreads primitive (or those of a comparable libray) on those platforms will take care of this and the programmer will not notice anything, provided he sticks to the rules.

Returning the local value in the core's private cache is always fine. If that wasn't the currently valid value (because it was changed in DRAM, a shared higher cache level or some other core's private cache), it would be no longer in your cache.

Already on x86 there is no general guarantee that the value read from cache is identical to the value stored in DRAM by another thread, especially if you consider multi-socket systems. What is guaranteed (on x86, not on other architectures) is that if CPU1 writes to A and then to B, and (some small time later) CPU2 reads B and then A, it will retrieve the new value from A if it retrieved the new value from B. But it is OK if it retrieves old (cached) values for both A and B or if it retrieves the old value for B and the new value for A.

syzygy · Post by **syzygy** » Fri Mar 21, 2014 1:49 pm

http://www.airs.com/blog/archives/154

For dealing with memory mapped hardware, volatile is exactly what you want. For most other types of code, including multi-threaded code, volatile does not help.

Using volatile does not mean that the variable is accessed atomically; no locks are used. Using volatile does not mean that other cores in a multi-core system will see the memory accesses; no cache flushes are used. While volatile writes are guaranteed to occur in the program order for the core which is executing them, there is no guarantee that any other core will see the writes in the same order. Using volatile does not imply any sort of memory barrier; the processor can and will rearrange volatile memory accesses (this will not happen for address ranges used for memory mapped hardware, but it will for ordinary memory).

Conversely, if you use the locking primitives which are part of any threading library, then you do not need to use volatile. The locking primitives will include the required memory barriers or cache flushes. They will include whatever special directives are needed to tell the compiler that memory must be stable.

lucasart · Post by **lucasart** » Fri Mar 21, 2014 2:00 pm

syzygy wrote:
lucasart wrote:OK, that means for built-in types read/write operations are atomic. This is because built-in types have a size that divides the cash line size. Hence, in the absence of unaligned memory access (which you would really have to provoke on purpose with some ugly C-tyle reinterpretation of pointers), you are guaranteed that they don't cross a cache line.
Yes, but remember that this is platform-dependent. It might be true for all platforms we know, but it is not mandated by C/C++ or POSIX (actually it is not true for all platforms we know: uint64 is a built-in type on 32-bit x86, but not atomic).

For example, I'm wondering if in this line of code:
https://github.com/lucasart/Sensei/blob ... i.cc#L5861
I can remove the lock protection.

If I can assume that 'p_workers++' is an atomic operation,
It is not atomic. Whether it is implemented as a single x86 increment instruction or as separate loads and stores, it is not atomic.

If you want atomic increment, you need to use inline assembly to generate a "lock inc" instruction, or gcc intrinsics (e.g. __atomic_fetch_add), or whatever the language provides (at least C++11 and I suppose C11 do).

Is there anything in the C++ standard that forbids 2/, and guarantees that the incrementation will be atomic?
The C++ standard allows both and neither is atomic.
The C++ standard before C++11 does not know about multithreading so has no notion of atomicity. It guarantees nothing at all as multithreading is concerned.

I don't think POSIX/pthreads makes any atomicity guarantees, but I might be wrong. Certainly it does not guarantee that p_workers++ is atomic.

C++11 provides atomic primitives, but does not guarantee that p_workers++ is atomic.

Should I define the variable p_workers as std::atomic<int> in order to get this guarantee?
I'm afraid that would only guarantee that reads and writes are atomic, not that ++ is atomic.

THank you for all the explanations. Very clear.

I've got some RTFM to do

syzygy · Post by **syzygy** » Fri Mar 21, 2014 2:15 pm

lucasart wrote:THank you for all the explanations. Very clear.

I've got some RTFM to do

You're welcome

It seems atomic increment in C++11 can be done using std::atomic_fetch_add. As far as I understand, the variable has to be declared using std::atomic for it to work.

I'm still not sure if ++ on an std::atomic type is executed atomically, but after googling a bit I think it does.

hgm · Post by **hgm** » Fri Mar 21, 2014 2:24 pm

syzygy wrote:Already on x86 there is no general guarantee that the value read from cache is identical to the value stored in DRAM by another thread, especially if you consider multi-socket systems. What is guaranteed (on x86, not on other architectures) is that if CPU1 writes to A and then to B, and (some small time later) CPU2 reads B and then A, it will retrieve the new value from A if it retrieved the new value from B. But it is OK if it retrieves old (cached) values for both A and B or if it retrieves the old value for B and the new value for A.

That amounts to the same thing, not? It only means the CPU that reads can be too early. There is no way to compare the absolute time scale on different CPUs.

If one CPU writes the location in its cache it broadcasts the address on the bus, so other CPUs invalidate any copies they might be holding. That should work also for multi-socket.

Indeed I am talking only about Intel architecture.

volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?