volatile?

Discussion of chess software programming and technical issues.

Moderator: Ras

Rein Halbersma
Posts: 772
Joined: Tue May 22, 2007 11:13 am

Re: volatile?

Post by Rein Halbersma »

bob wrote:
syzygy wrote:
bob wrote:
Pretty funny discussion however, since the KERNEL uses volatile for its own spin locks among other things..
Your insight in these matters is just amazing.
Just factual. Want me to show you the volatile ints in the kernel source for spin locks?
Yes, please show us:

https://github.com/torvalds/linux/blob/ ... spinlock.h
https://github.com/torvalds/linux/blob/ ... spinlock.c
syzygy
Posts: 5978
Joined: Tue Feb 28, 2012 11:56 pm

Re: volatile?

Post by syzygy »

bob wrote:
syzygy wrote:
bob wrote:I'd bet performance IS important for MOST of us in computer chess.
In so far as it was a factor in Lucas' question, the answer is that the use of volatiles in Senpai only slows down the engine. (I've tested and #define'ing volatile away gave a very slight speedup, just 0.5%.)
Hence my comment.
The one where you argued that the compiler won't reload non-volatiles after a pthread_mutex_lock()? Right...
Correct. If you include the pthread_mutex_lock source into the compiled program, it will optimize right across it, because it can see all the side effects (or the lack thereof). And it is not JUST pthread_mutex_lock() where this is a problem.
Yes, if you want to mess things up you can always mess things up.
You claim the compiler has specific knowledge about that procedure.
I never claimed that. I claimed: POSIX guarantees that it works. And indeed, it works.

That it works because posix_mutex_lock() is a function call to code that the compiler knows nothing about is irrelevant. That is merely how the standard is implemented. Users of a POSIX compliant system do not have to look under the hood of the system but can simply rely on the standard.

Again, to appreciate this requires an ability of abstraction that some are sorely lacking. This is the strcpy() story all over again.
Just factual. Want me to show you the volatile ints in the kernel source for spin locks?
Sure.

Maybe you can also explain why the kernel does not use volatile for shared variables? How come it does not need to do that to force reloading from memory etc.?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: volatile?

Post by bob »

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:I'd bet performance IS important for MOST of us in computer chess.
In so far as it was a factor in Lucas' question, the answer is that the use of volatiles in Senpai only slows down the engine. (I've tested and #define'ing volatile away gave a very slight speedup, just 0.5%.)
Hence my comment.
The one where you argued that the compiler won't reload non-volatiles after a pthread_mutex_lock()? Right...
Correct. If you include the pthread_mutex_lock source into the compiled program, it will optimize right across it, because it can see all the side effects (or the lack thereof). And it is not JUST pthread_mutex_lock() where this is a problem.
Yes, if you want to mess things up you can always mess things up.
You claim the compiler has specific knowledge about that procedure.
I never claimed that. I claimed: POSIX guarantees that it works. And indeed, it works.

That it works because posix_mutex_lock() is a function call to code that the compiler knows nothing about is irrelevant. That is merely how the standard is implemented. Users of a POSIX compliant system do not have to look under the hood of the system but can simply rely on the standard.

Again, to appreciate this requires an ability of abstraction that some are sorely lacking. This is the strcpy() story all over again.
Just factual. Want me to show you the volatile ints in the kernel source for spin locks?
Sure.

Maybe you can also explain why the kernel does not use volatile for shared variables? How come it does not need to do that to force reloading from memory etc.?
I specifically mentioned spin locks. As far as the kernel, why don't you log on to a linux box, cd to /usr/src/kernels/-whatever-, and start recursively using grep to search for volatile?

A sample from 3.13.6 on fedora 20:

include/linux/pagemap.h: volatile char c;
include/linux/pagemap.h: volatile char c;
include/linux/parport.h: volatile long int timeout;
include/linux/parport.h: volatile enum ieee1284_phase phase;
include/linux/spinlock_types_up.h: volatile unsigned int slick;
include/net/ip_vs.h: volatile __u32 flags; /* status flags */
include/net/ip_vs.h: volatile unsigned long timeout; /* timeout */
include/net/ip_vs.h: volatile __u16 state; /* state info */
include/net/ip_vs.h: volatile __u16 old_state; /* old state, to be used for
include/net/ip_vs.h: volatile unsigned int flags; /* dest status flags */
include/net/ip_vs.h: volatile int sync_state;
include/net/ip_vs.h: volatile int master_syncid;
include/net/ip_vs.h: volatile int backup_syncid;
include/net/sock.h: volatile unsigned char skc_state;
include/scsi/scsi_cmnd.h: volatile int Status;
include/scsi/scsi_cmnd.h: volatile int Message;
include/scsi/scsi_cmnd.h: volatile int have_data_in;
include/scsi/scsi_cmnd.h: volatile int sent_command;
include/scsi/scsi_cmnd.h: volatile int phase;
include/video/gbe.h: volatile uint32_t ctrlstat; /* general control */
include/video/gbe.h: volatile uint32_t dotclock; /* dot clock PLL control */
include/video/gbe.h: volatile uint32_t i2c; /* crt I2C control */
include/video/gbe.h: volatile uint32_t sysclk; /* system clock PLL control */
include/video/gbe.h: volatile uint32_t i2cfp; /* flat panel I2C control */
include/video/gbe.h: volatile uint32_t id; /* device id/chip revision */
include/video/gbe.h: volatile uint32_t config; /* power on configuration [1] */
include/video/gbe.h: volatile uint32_t bist; /* internal bist status [1] */
include/video/gbe.h: volatile uint32_t vt_xy; /* current dot coords */
include/video/gbe.h: volatile uint32_t vt_xymax; /* maximum dot coords */
include/video/gbe.h: volatile uint32_t vt_vsync; /* vsync on/off */
include/video/gbe.h: volatile uint32_t vt_hsync; /* hsync on/off */
include/video/gbe.h: volatile uint32_t vt_vblank; /* vblank on/off */
include/video/gbe.h: volatile uint32_t vt_hblank; /* hblank on/off */
include/video/gbe.h: volatile uint32_t vt_flags; /* polarity of vt signal
include/video/gbe.h: volatile uint32_t vt_intr01; /* intr 0,1 y coords */
include/video/gbe.h: volatile uint32_t vt_intr23; /* intr 2,3 y coords */
include/video/gbe.h: volatile uint32_t fp_hdrv; /* flat panel hdrv on/off */
include/video/gbe.h: volatile uint32_t fp_vdrv; /* flat panel vdrv on/off */
include/video/gbe.h: volatile uint32_t fp_de; /* flat panel de on/off */
include/video/gbe.h: volatile uint32_t vt_hpixen; /* intrnl horiz pixel on/off */
include/video/gbe.h: volatile uint32_t vt_vpixen; /* intrnl vert pixel on/off */
include/video/gbe.h: volatile uint32_t vt_hcmap; /* cmap write (horiz) */
include/video/gbe.h: volatile uint32_t vt_vcmap; /* cmap write (vert) */
include/video/gbe.h: volatile uint32_t did_start_xy; /* eol/f did/xy reset val */
include/video/gbe.h: volatile uint32_t crs_start_xy; /* eol/f crs/xy reset val */
include/video/gbe.h: volatile uint32_t vc_start_xy; /* eol/f vc/xy reset val */
include/video/gbe.h: volatile uint32_t ovr_width_tile;/*overlay plane ctrl 0 */
include/video/gbe.h: volatile uint32_t ovr_inhwctrl; /* overlay plane ctrl 1 */
include/video/gbe.h: volatile uint32_t ovr_control; /* overlay plane ctrl 1 */
include/video/gbe.h: volatile uint32_t frm_size_tile;/* normal plane ctrl 0 */
include/video/gbe.h: volatile uint32_t frm_size_pixel;/*normal plane ctrl 1 *
include/asm-generic/io.h:static inline unsigned long virt_to_phys(volatile void *address)
include/asm-generic/io.h:static inline unsigned long virt_to_bus(volatile void *address)
include/drm/drmP.h: __volatile__ int waiting; /**< On kernel DMA queue */
include/drm/drmP.h: __volatile__ int pending; /**< On hardware DMA queue */


That's just a few, selectively picked out because the lines were short.

No volatiles? I found almost a thousand with a quick search. I didn't check 10 levels deep either...

Other questions?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: volatile?

Post by bob »

Rein Halbersma wrote:
bob wrote:
syzygy wrote:
bob wrote:
Pretty funny discussion however, since the KERNEL uses volatile for its own spin locks among other things..
Your insight in these matters is just amazing.
Just factual. Want me to show you the volatile ints in the kernel source for spin locks?
Yes, please show us:

https://github.com/torvalds/linux/blob/ ... spinlock.h
https://github.com/torvalds/linux/blob/ ... spinlock.c
See my reply to Ronald right below this post. I found a thousand just going a few levels deep in the directory structure for 3.13.6, the current fedora 20 kernel.

Care to try again?

Or at least LOOK again?

It looks like the lock stuff was completely rewritten, eliminating the spin locks that burn cycles, adding a queue that everyone lines up in to gain access. Cute idea. STILL plenty of volatiles around. The lock used in Crafty came from older kernel sources. Still works flawlessly.
syzygy
Posts: 5978
Joined: Tue Feb 28, 2012 11:56 pm

Re: volatile?

Post by syzygy »

Ok, I will not check whether they all fall under one of the exceptions listed in https://www.kernel.org/doc/Documentatio ... armful.txt (prima facie most seem to do). And your "offer" related so spinlocks (and spinlock_types_up.h smells like uniprocessor), but never mind.

Fact is, most shared data in the kernel is not volatile.

The reason is pretty simple: most shared data is accessed under lock protection.
Those locks were not amateuristically written by Mr Hyatt, but include the necessary compiler and hardware memory barriers.

Code: Select all

Consider a typical block of kernel code:

    spin_lock(&the_lock);
    do_something_on(&shared_data);
    do_something_else_with(&shared_data);
    spin_unlock(&the_lock);

If all the code follows the locking rules, the value of shared_data cannot
change unexpectedly while the_lock is held.  Any other code which might
want to play with that data will be waiting on the lock.  The spinlock
primitives act as memory barriers - they are explicitly written to do so -
meaning that data accesses will not be optimized across them.  So the
compiler might think it knows what will be in shared_data, but the
spin_lock() call, since it acts as a memory barrier, will force it to
forget anything it knows.  There will be no optimization problems with
accesses to that data.
Note: If all the code follows the locking rules
These "rules" do not allow you to roll your own amateuristic locks by randomly copying and pasting.

Senpai, which this thread is about, accesses all shared data under lock protection (well, not the tt table). Senpai declares this data as volatile (but not the tt table, I think, but I did not check).

Fact is, all these volatile declarations can be safely removed from Senpai. It would result in a small speedup.

Fabien has explained why he uses the volatile keyword. It is clear that speed is not his main concern at this stage in the development of his new engine.

Maybe I should just ask this question: can you explain why Senpai does not need to use volatile for its shared variables?

Btw, in the post where I wrote
syzygy wrote:For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.
you probably never reached the paragraph immediately following it:
syzygy wrote:If you don't want to be restricted by a thread library, but (in addition) want to rely on how the compiler will map your code to the machine, then volatile is useful. Your program will be full of undefined behavior, but if that is a conscious choice that is not necessarily bad. It will just not be "standard C/C++ plus pthreads" or "standard C11/C++11".
syzygy
Posts: 5978
Joined: Tue Feb 28, 2012 11:56 pm

Re: volatile?

Post by syzygy »

bob wrote:The lock used in Crafty came from older kernel sources. Still works flawlessly.
You can safely remove the "volatile" from "volatile int *lock" in this code:

Code: Select all

static void __inline__ LockX86(volatile int *lock) {
  int dummy;
  asm __volatile__(
      "1:          movl    $1, %0"   "\n\t"
      "            xchgl   (%1), %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jz      3f"       "\n\t"
      "2:          pause"            "\n\t"
      "            movl    (%1), %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jnz     2b"       "\n\t"
      "            jmp     1b"       "\n\t"
      "3:"                           "\n\t"
      :"=&q"(dummy)
      :"q"(lock)
      :"cc", "memory");
}
static void __inline__ UnlockX86(volatile int *lock) {
  int dummy;
  asm __volatile__(
      "movl    $0, (%1)"
      :"=&q"(dummy)
      :"q"(lock));
}
The __volatile__ in "asm __volatile__" is important, though.

Can you explain why the shared variable gets reloaded:

Code: Select all

#define LOCK(lock) \
do { \
  int dummy; \
  asm __volatile__( \
      "1:          movl    $1, %0"   "\n\t" \
      "            xchgl   (%1), %0" "\n\t" \
      "            testl   %0, %0"   "\n\t" \
      "            jz      3f"       "\n\t" \
      "2:          pause"            "\n\t" \
      "            movl    (%1), %0" "\n\t" \
      "            testl   %0, %0"   "\n\t" \
      "            jnz     2b"       "\n\t" \
      "            jmp     1b"       "\n\t" \
      "3:"                           "\n\t" \
      :"=&q"(dummy) \
      :"q"(lock) \
      :"cc", "memory"); \
} while (0)
#define UNLOCK(lock) \
do { \
  int dummy; \
  asm __volatile__( \
      "movl    $0, (%1)" \
      :"=&q"(dummy) \
      :"q"(lock)); \
} while (0)

int variable; // shared variable with another thread
int smp_lock;

int i, j;

int funct(void) {

  i = variable;

  LOCK(smp_lock);

  j = variable;

  UNLOCK(smp_lock); 

  return 0;
}

Code: Select all

funct:
.LFB0:
	.cfi_startproc
	movl	variable(%rip), %eax
	movl	%eax, i(%rip)
	movl	smp_lock(%rip), %eax
#APP
# 37 "bla.c" 1
	1:          movl    $1, %edx
	            xchgl   (%eax), %edx
	            testl   %edx, %edx
	            jz      3f
	2:          pause
	            movl    (%eax), %edx
	            testl   %edx, %edx
	            jnz     2b
	            jmp     1b
	3:
	
# 0 "" 2
#NO_APP
	movl	variable(%rip), %eax
	movl	%eax, j(%rip)
	movl	smp_lock(%rip), %eax
#APP
# 41 "bla.c" 1
	movl    $0, (%eax)
# 0 "" 2
#NO_APP
	xorl	%eax, %eax
	ret
	.cfi_endproc
Compiled with -O3.

edit: I think I messed up a bit in the above code. Obviously the address of smp_lock should be taken. Anyway, the point is that memory acccesses are not reordered across LOCK(). How come...
syzygy
Posts: 5978
Joined: Tue Feb 28, 2012 11:56 pm

Re: volatile?

Post by syzygy »

Instead of trying to get the above right, the same code with my spinlock implementation (yeah, I also think I know better than pthreads), leaving out UNLOCK for conciseness:

Code: Select all

#define LOCK(x) \
do { \
  __asm__ volatile ( \
  "1:\n\t" \
  "lock decl %0\n\t" \
  "jne 2f\n\t" \
  ".subsection 1\n" \
  "2:\n\t" \
  "pause\n\t" \
  "cmpl $0, %0\n\t" \
  "jg 1b\n\t" \
  "jmp 2b\n\t" \
  ".subsection 0" \
  : "+m" (*(x)) : : "memory"); \
} while (0)

int variable; // shared variable with another thread
int smp_lock;

int i, j;

int funct(int v) {

  i = variable;

  LOCK(&smp_lock);

  j = variable;

  return 0;
}
With -O3:

Code: Select all

funct:
.LFB0:
	.cfi_startproc
	movl	variable(%rip), %eax
	movl	%eax, i(%rip)
#APP
# 26 "bla.c" 1
	1:
	lock decl smp_lock(%rip)
	jne 2f
	.subsection 1
2:
	pause
	cmpl $0, smp_lock(%rip)
	jg 1b
	jmp 2b
	.subsection 0
# 0 "" 2
#NO_APP
	movl	variable(%rip), %eax
	movl	%eax, j(%rip)
	xorl	%eax, %eax
	ret
	.cfi_endproc
Again no reordering. No need to make "variable" volatile. How come?

Interestingly I also define the lock variables themselves as volatile, but that does not seem to be necessary (or do anything at all) if the variable is accessed only by asm __volatile__'s anyway. But maybe the compiler would otherwise be allowed to copy "smp_lock" to a local variable, apply the LOCK() to that, then copy the variabe back later. That's pretty unlikely to happen though. Anyway, making the smp_lock volatile won't hurt. Or better, just use pthread_mutex_lock() and never screw up.

Btw, for inlining my spinlock seems better than crafty's. Crafty's was written not to be inlined.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: volatile?

Post by bob »

syzygy wrote:Ok, I will not check whether they all fall under one of the exceptions listed in https://www.kernel.org/doc/Documentatio ... armful.txt (prima facie most seem to do). And your "offer" related so spinlocks (and spinlock_types_up.h smells like uniprocessor), but never mind.

Fact is, most shared data in the kernel is not volatile.
Fact is, MOST shared data in Crafty is not volatile. Big surprise? Volatile has a specific purpose.

Here's a complete list of mine, with an explanation for each:
following is in the tree structure which is shared at a split point. ANY thread can set "stop" to true telling whichever thread is currently using this split block to stop, we've found a fail high at this split point, your search is pointless.

chess.h: volatile int stop;

The following is also in the tree structure and can change dynamically as threads complete their search. Other threads might still need the ids for any helpers.

chess.h: struct tree *volatile siblings[CPUS], *parent;

The following is simply the number of processors working at this split point. At times, there is a thread waiting for this to hit zero so it can exit ThreadWait() and return to the previous ply. Since this can change spontaneously, it is volatile.

chess.h: volatile int nprocs;

The following is a simple flag that says "this split block is in use". Only the owner will change it to zero, but the code to allocate a split block needs the correct value to avoid allocating a used block, or not allocating a free block.

chess.h: volatile int used;

The following is one per thread, used as a pointer to work it needs to do. Idle threads spin waiting on tree to become non-zero. Since they are in a tight spin loop, repeatedly reading the value from L1 cache, they need to know that they must access the value each "iteration" to notice they now have work to do.

chess.h: TREE * volatile tree;

Count of number of idle threads. Changes as thread start to work, and finish work. Search() tests this (and split below) to see if it needs to do a split.

data.h:extern volatile int smp_idle;

This is an optimization so that once one cpu starts to split work, the others will not queue up waiting to acquire the lock to do a split, since there will be no cpus left to split with.

data.h:extern volatile int smp_split;

The following is used to guarantee that all threads have been started, and that they have initialized all of their local data (which on a numa box faults those pages into their local memory where we want them). Without this, threads could access each other's local memory and cause it to be placed on the wrong numa node.

data.h:extern volatile int initialized_threads;

You can look at chess.h and data.h to see how many shared values there actually are. Most are not made volatile because they don't get changed spontaneously by other threads.

The reason is pretty simple: most shared data is accessed under lock protection.
Sorry, but no. Just as in Crafty, LOTS of data is shared but not modified. You only need locks where there is potential contention/races that need to be avoided.

Those locks were not amateuristically written by Mr Hyatt, but include the necessary compiler and hardware memory barriers.

Your ignorance is showing. My lock was NOT "amateurishly written by me. it was taken DIRECTLY from the kernel source when I started the parallel search code circa 1996. Wonder who "amateurishly" wrote that code? Initials are "LT" for the record. Since X86 is always in-order store architecture, barriers are not needed. The fences are needed here and there for I/O, obviously.

Code: Select all

Consider a typical block of kernel code:

    spin_lock(&the_lock);
    do_something_on(&shared_data);
    do_something_else_with(&shared_data);
    spin_unlock(&the_lock);

If all the code follows the locking rules, the value of shared_data cannot
change unexpectedly while the_lock is held.  Any other code which might
want to play with that data will be waiting on the lock.  The spinlock
primitives act as memory barriers - they are explicitly written to do so -
meaning that data accesses will not be optimized across them.  So the
compiler might think it knows what will be in shared_data, but the
spin_lock() call, since it acts as a memory barrier, will force it to
forget anything it knows.  There will be no optimization problems with
accesses to that data.
Note: If all the code follows the locking rules
These "rules" do not allow you to roll your own amateuristic locks by randomly copying and pasting.

Senpai, which this thread is about, accesses all shared data under lock protection (well, not the tt table). Senpai declares this data as volatile (but not the tt table, I think, but I did not check).

Fact is, all these volatile declarations can be safely removed from Senpai. It would result in a small speedup.

Fabien has explained why he uses the volatile keyword. It is clear that speed is not his main concern at this stage in the development of his new engine.

Maybe I should just ask this question: can you explain why Senpai does not need to use volatile for its shared variables?

If you use locks on ANY shared variable where writes are done, the problem goes away at a significant performance cost in some places. [Edit: I mean using locks everywhere a shared variable is accessed, IF and only if it is modified somewhere as well. I thought it was clear, but noticed it was potentially ambiguous.] For example, when I give a thread work to do, the thread is busy working within a few nanoseconds. Do that with a lock. First you smoke the cache bus with all the contention, plus you get to wait for all of the read-for-ownerships to be synchronized. I take nanoseconds every time when it works FLAWLESSLY. Just because someone doesn't like the volatile approach does NOT mean it does not work correctly AND efficiently. I do it where needed. I use locks where appropriate. All with the goal of maximizing performance.

Btw, in the post where I wrote
syzygy wrote:For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.
you probably never reached the paragraph immediately following it:
syzygy wrote:If you don't want to be restricted by a thread library, but (in addition) want to rely on how the compiler will map your code to the machine, then volatile is useful. Your program will be full of undefined behavior, but if that is a conscious choice that is not necessarily bad. It will just not be "standard C/C++ plus pthreads" or "standard C11/C++11".
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: volatile?

Post by bob »

syzygy wrote:
bob wrote:The lock used in Crafty came from older kernel sources. Still works flawlessly.
You can safely remove the "volatile" from "volatile int *lock" in this code:

Code: Select all

static void __inline__ LockX86(volatile int *lock) {
  int dummy;
  asm __volatile__(
      "1:          movl    $1, %0"   "\n\t"
      "            xchgl   (%1), %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jz      3f"       "\n\t"
      "2:          pause"            "\n\t"
      "            movl    (%1), %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jnz     2b"       "\n\t"
      "            jmp     1b"       "\n\t"
      "3:"                           "\n\t"
      :"=&q"(dummy)
      :"q"(lock)
      :"cc", "memory");
}
static void __inline__ UnlockX86(volatile int *lock) {
  int dummy;
  asm __volatile__(
      "movl    $0, (%1)"
      :"=&q"(dummy)
      :"q"(lock));
}
The __volatile__ in "asm __volatile__" is important, though.

Can you explain why the shared variable gets reloaded:
You DO realize this program runs on other than X86? The alpha compiler from DEC had a software lock function that did what mine does above, but in C, using the interlocked exchange() mechanism their compiler used. Remove the volatile and the alpha would hang there.

Obviously not needed for my inline asm, since I explicitly reload the value every time, but think outside the box, in a BIGGER box that includes alpha, Cray, Sun, IBM PPC, MIPS, etc...

Code: Select all

#define LOCK(lock) \
do { \
  int dummy; \
  asm __volatile__( \
      "1:          movl    $1, %0"   "\n\t" \
      "            xchgl   (%1), %0" "\n\t" \
      "            testl   %0, %0"   "\n\t" \
      "            jz      3f"       "\n\t" \
      "2:          pause"            "\n\t" \
      "            movl    (%1), %0" "\n\t" \
      "            testl   %0, %0"   "\n\t" \
      "            jnz     2b"       "\n\t" \
      "            jmp     1b"       "\n\t" \
      "3:"                           "\n\t" \
      :"=&q"(dummy) \
      :"q"(lock) \
      :"cc", "memory"); \
} while (0)
#define UNLOCK(lock) \
do { \
  int dummy; \
  asm __volatile__( \
      "movl    $0, (%1)" \
      :"=&q"(dummy) \
      :"q"(lock)); \
} while (0)

int variable; // shared variable with another thread
int smp_lock;

int i, j;

int funct(void) {

  i = variable;

  LOCK(smp_lock);

  j = variable;

  UNLOCK(smp_lock); 

  return 0;
}

Code: Select all

funct:
.LFB0:
	.cfi_startproc
	movl	variable(%rip), %eax
	movl	%eax, i(%rip)
	movl	smp_lock(%rip), %eax
#APP
# 37 "bla.c" 1
	1:          movl    $1, %edx
	            xchgl   (%eax), %edx
	            testl   %edx, %edx
	            jz      3f
	2:          pause
	            movl    (%eax), %edx
	            testl   %edx, %edx
	            jnz     2b
	            jmp     1b
	3:
	
# 0 "" 2
#NO_APP
	movl	variable(%rip), %eax
	movl	%eax, j(%rip)
	movl	smp_lock(%rip), %eax
#APP
# 41 "bla.c" 1
	movl    $0, (%eax)
# 0 "" 2
#NO_APP
	xorl	%eax, %eax
	ret
	.cfi_endproc
Compiled with -O3.

edit: I think I messed up a bit in the above code. Obviously the address of smp_lock should be taken. Anyway, the point is that memory acccesses are not reordered across LOCK(). How come...
I didn't feel up to looking at what you are doing. But first things first. The volatile declaration on the inline asm is important. The compiler can't reorder across that, ever, nor can it change the location of the inline assembler by moving it up or down, before or after existing code.

If that is what you are talking about, that's the reason. Done quite intentionally, obviously.

BTW, volatile doesn't apply to asm code. There is no such thing as "volatile" there. With Crafty, ONLY for x86, remove it or keep it. I personally believe it makes the code more readable by being there when it is not needed, a "WARNING: this value can change spontaneously..."
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: volatile?

Post by bob »

syzygy wrote:Instead of trying to get the above right, the same code with my spinlock implementation (yeah, I also think I know better than pthreads), leaving out UNLOCK for conciseness:

Code: Select all

#define LOCK(x) \
do { \
  __asm__ volatile ( \
  "1:\n\t" \
  "lock decl %0\n\t" \
  "jne 2f\n\t" \
  ".subsection 1\n" \
  "2:\n\t" \
  "pause\n\t" \
  "cmpl $0, %0\n\t" \
  "jg 1b\n\t" \
  "jmp 2b\n\t" \
  ".subsection 0" \
  : "+m" (*(x)) : : "memory"); \
} while (0)

int variable; // shared variable with another thread
int smp_lock;

int i, j;

int funct(int v) {

  i = variable;

  LOCK(&smp_lock);

  j = variable;

  return 0;
}
With -O3:

Code: Select all

funct:
.LFB0:
	.cfi_startproc
	movl	variable(%rip), %eax
	movl	%eax, i(%rip)
#APP
# 26 "bla.c" 1
	1:
	lock decl smp_lock(%rip)
	jne 2f
	.subsection 1
2:
	pause
	cmpl $0, smp_lock(%rip)
	jg 1b
	jmp 2b
	.subsection 0
# 0 "" 2
#NO_APP
	movl	variable(%rip), %eax
	movl	%eax, j(%rip)
	xorl	%eax, %eax
	ret
	.cfi_endproc
Again no reordering. No need to make "variable" volatile. How come?

Interestingly I also define the lock variables themselves as volatile, but that does not seem to be necessary (or do anything at all) if the variable is accessed only by asm __volatile__'s anyway. But maybe the compiler would otherwise be allowed to copy "smp_lock" to a local variable, apply the LOCK() to that, then copy the variabe back later. That's pretty unlikely to happen though. Anyway, making the smp_lock volatile won't hurt. Or better, just use pthread_mutex_lock() and never screw up.

Btw, for inlining my spinlock seems better than crafty's. Crafty's was written not to be inlined.
Feel free to show me how to declare a variable as volatile in asm... But I have to think a big "bigger" than you, because I also run my code on a half-dozen OTHER architectures, some that have a built-in locking mechanism that requires a volatile lock. Or actually requires a sort of lock_t declaration which turns into a "volatile int". For x86, since I reload the value each cycle through the spin loop, I am treating it as volatile automatically. There's no way to tell the assembler something is volatile because the assembly only takes the code you write and doesn't change anything.