lockless hashing

hgm · Post by **hgm** » Fri May 13, 2016 5:18 pm

bob wrote:Not sure where your 7 comes from.

4 bytes signature
2 bytes move + bound-type flags
2 bytes score
1 byte depth
___________ +
9 bytes

7 x 9 = 63

That leaves 1 unused byte in the cache line. This could be used to store aging flags for all 7 entries. As a move in Chess only needs 13 bits (two 6-bit square codes and a 1-bit flag to indicate 'special moves' like castlings or promotions which are then further specified by a table index stored in the to-square), you would even have a spare bit in the move field as well (bound type only takes 2 bits). You could use that as an in-check bit. (Of course 16 bits for the score is pretty generous as well; Fairy-Max uses +/- 8000 for mate scores, which fits in 14 bits. So there would be some spare bits there as well. The depth would also never really run upto 256.)

flok · Post by **flok** » Fri May 13, 2016 6:42 pm

bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.

I've measured it with 12 threads; same overhead.

petero2 · Post by **petero2** » Fri May 13, 2016 7:14 pm

flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.

It is portable if you use C++11 atomics and memory_order_relaxed. You can look at the texel source code for an implementation.

flok · Post by **flok** » Fri May 13, 2016 7:22 pm

petero2 wrote:
flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.
It is portable if you use C++11 atomics and memory_order_relaxed. You can look at the texel source code for an implementation.

Atomics use memory barriers as far as I know.
That's not free.

petero2 · Post by **petero2** » Fri May 13, 2016 7:26 pm

flok wrote:
petero2 wrote:
flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.
It is portable if you use C++11 atomics and memory_order_relaxed. You can look at the texel source code for an implementation.
Atomics use memory barriers as far as I know.
That's not free.

They don't use memory barriers if you use memory_order_relaxed. On current cpu architectures they are therefore free. There is however no way to know if they will be free in all future cpu architectures.

mar · Post by **mar** » Fri May 13, 2016 8:17 pm

petero2 wrote:On current cpu architectures they are therefore free.

I'm not quite sure about that. If you have many threads that RMW the same cache line, false sharing will kill performance.
Writing into a cacheline that's shared among cores will force them to be invalidated (at least on x86/x64).
In the spinlock sheme mentioned above [hash_index & mask] it would be definitely a good idea to make them cacheline-aligned.

When I tested atomic increment/decrement for refcounting some time ago, even with no contention atomics were about 9 times slower than plain inc/dec IIRC.

petero2 · Post by **petero2** » Fri May 13, 2016 9:13 pm

mar wrote:
petero2 wrote:On current cpu architectures they are therefore free.
I'm not quite sure about that. If you have many threads that RMW the same cache line, false sharing will kill performance.

There are no RMW operations involved though when using memory_order_relaxed. Here is the relevant parts of the source code I use:

Code: Select all

struct TTEntryStorage &#123;
    std&#58;&#58;atomic<U64> key;
    std&#58;&#58;atomic<U64> data;
&#125;;

class TTEntry &#123;
    U64 key;
    U64 data;
&#125;;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;store&#40;TTEntryStorage& ent&#41; &#123;
    ent.key.store&#40;key ^ data, std&#58;&#58;memory_order_relaxed&#41;;
    ent.data.store&#40;data, std&#58;&#58;memory_order_relaxed&#41;;
&#125;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;load&#40;const TTEntryStorage& ent&#41; &#123;
    key = ent.key.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    data = ent.data.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    key ^= data;
&#125;

When I compile this I get the following assembly code:

Code: Select all

store&#58;
	movq	8&#40;%rdi&#41;, %rax
	xorq	(%rdi&#41;, %rax
	movq	%rax, (%rsi&#41;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, 8&#40;%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	%rax, (%rdi&#41;
	movq	8&#40;%rsi&#41;, %rax
	xorq	%rax, (%rdi&#41;
	movq	%rax, 8&#40;%rdi&#41;
	ret

There are no locked instructions and no memory barriers. I get almost the same assembly code if I change the TTEntryStorage to U64:

Code: Select all

store&#58;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, %rdx
	xorq	(%rdi&#41;, %rdx
	movq	%rax, 8&#40;%rsi&#41;
	movq	%rdx, (%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	8&#40;%rsi&#41;, %rdx
	xorq	%rdx, %rax
	movq	%rdx, 8&#40;%rdi&#41;
	movq	%rax, (%rdi&#41;
	ret

The only difference is that the compiler missed an optimization opportunity when using memory_order_relaxed, so that it performs an extra read (in the store case) or write (in the load case) from thread-local memory (which will be an L1 cache hit). I doubt this has a measurable speed impact, but it can be fixed at the cost of making the code longer:

Code: Select all

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;store&#40;TTEntryStorage& ent&#41; &#123;
    U64 k = key;
    U64 d = data;
    ent.key.store&#40;k ^ d, std&#58;&#58;memory_order_relaxed&#41;;
    ent.data.store&#40;d, std&#58;&#58;memory_order_relaxed&#41;;
&#125;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;load&#40;const TTEntryStorage& ent&#41; &#123;
    U64 k = ent.key.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    U64 d = ent.data.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    k ^= d;
    key = k;
    data = d;
&#125;

With that source code I get the following assembly code:

Code: Select all

store&#58;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, %rdx
	xorq	(%rdi&#41;, %rdx
	movq	%rdx, (%rsi&#41;
	movq	%rax, 8&#40;%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	8&#40;%rsi&#41;, %rdx
	xorq	%rdx, %rax
	movq	%rdx, 8&#40;%rdi&#41;
	movq	%rax, (%rdi&#41;
	ret

Now the only difference is that the compiler swapped the order of the last two move instructions in the store function.

In conclusion, using memory_order_relaxed for lockless hashing can be made to have zero overhead on the x64 architecture.

bob · Post by **bob** » Sat May 14, 2016 12:32 am

flok wrote:
bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.

How is your parallel search scaling with 12 cores? If it does poorly, how do you know what lock overhead looks like? It is pretty difficult to measure it.

flok · Post by **flok** » Sat May 14, 2016 12:39 am

bob wrote:
flok wrote:
bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.
How is your parallel search scaling with 12 cores? If it does poorly, how do you know what lock overhead looks like? It is pretty difficult to measure it.

Well, to be honest, I'm not sure.
As mentioned in an other thread: it uses almost all cpu time with hardly any nps or time-to-depth gain.

jdart · Post by **jdart** » Sat May 14, 2016 12:47 am

I have 32-bit scores (I put this in to aid the auto-tuning stuff I have). And I store both the static score if available and the search score. So I have 3 64-bit quantities per hash entry, or 24 bytes. There is a little wasted space but not much.

--Jon

lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing