Better NPS scaling for Stockfish

zullil · Post by **zullil** » Sun Mar 01, 2015 2:01 am

lucasart wrote: But what you forget to mention is that they are counter-productive in the case of HT.

Didn't forget to mention this. Actually knew nothing about it.

syzygy · Post by **syzygy** » Sun Mar 01, 2015 4:00 am

For those that might find this of interest, the C++11 code for the spinlock:

#include <atomic>

class Spinlock {
  std::atomic_int lock;
public:
  Spinlock() { lock = 1; } // Init here to workaround a bug with MSVC 2013
  void acquire() {
    while (lock.fetch_sub(1, std::memory_order_acquire) != 1)
        while (lock.load(std::memory_order_relaxed) <= 0) {}
  }
  void release() { lock.store(1, std::memory_order_release); }
};

int spin_lock(Spinlock s)
{
  s.acquire();
  return 0;
}

I've added the spin_lock() function to simulate pthread_spin_lock(). Let's see what various compilers make of this (removing alignment instructions).

Using gcc-4.7.3:

Code: Select all

_Z9spin_lockP8Spinlock:
.L2:
        movl    (%rdi), %eax
.L4:
        leal    -1(%rax), %edx
        movl    %eax, %ecx
        lock cmpxchgl   %edx, (%rdi)
        jne     .L4
        cmpl    $1, %ecx
        je      .L12
.L7:
        movl    (%rdi), %eax
        testl   %eax, %eax
        jg      .L2
        jmp     .L7
.L12:
        xorl    %eax, %eax
        ret

Both gcc-4.8.1 and gcc-4.9.0 give:

Code: Select all

_Z9spin_lockP8Spinlock:
.L2:
        movl    $-1, %eax
        lock xaddl      %eax, (%rdi)
        cmpl    $1, %eax
        je      .L7
.L5:
        movl    (%rdi), %eax
        testl   %eax, %eax
        jle     .L5
        jmp     .L2
.L7:
        xorb    %al, %al
        ret

The C++11 code is modelled after glibc's pthread_spin_lock() implementation:

Code: Select all

pthread_spin_lock:
        mov     4(%esp), %eax
1:      LOCK
        decl    0(%eax)
        jne     2f
        xor     %eax, %eax
        ret

        .align  16
2:      rep
        nop
        cmpl    $0, 0(%eax)
        jg      1b
        jmp     2b

The code produced by gcc-4.7.3 isn't very good as it uses a lock cmpxchg loop to implement the atomic subtraction of 1.

The code produced by gcc-4.8.1/4.9.0 uses lock xadd which is better. Strangely these compilers miss the fact that subtracting 1 / adding -1 can be implemented using a decrement operation. They also miss the fact that there is no need to compare the original lock value with 1; it is sufficient to test the decremented value for 0 (which does not nead the cmp). But this is a minor problem.

So how about the pause instruction. Glibc has it: pause has the same opcode as "rep nop". To add it to the C++11 implementation:

Code: Select all

#include <immintrin.h>
...

  void acquire() {
    while (lock.fetch_sub(1, std::memory_order_acquire) != 1)
        while (lock.load(std::memory_order_relaxed) <= 0) { _mm_pause(); }

Using gcc-4.8.1:

Code: Select all

_Z9spin_lockP8Spinlock:
.L2:
        movl    $-1, %eax
        lock xaddl      %eax, (%rdi)
        cmpl    $1, %eax
        je      .L7
.L5:
        movl    (%rdi), %eax
        testl   %eax, %eax
        jg      .L2
        rep nop
        jmp     .L5
.L7:
        xorb    %al, %al
        ret

Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish