lockless hashing

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
hgm
Posts: 23605
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: lockless hashing

Post by hgm » Fri May 13, 2016 3:18 pm

bob wrote:Not sure where your 7 comes from.
4 bytes signature
2 bytes move + bound-type flags
2 bytes score
1 byte depth
___________ +
9 bytes

7 x 9 = 63

That leaves 1 unused byte in the cache line. This could be used to store aging flags for all 7 entries. As a move in Chess only needs 13 bits (two 6-bit square codes and a 1-bit flag to indicate 'special moves' like castlings or promotions which are then further specified by a table index stored in the to-square), you would even have a spare bit in the move field as well (bound type only takes 2 bits). You could use that as an in-check bit. (Of course 16 bits for the score is pretty generous as well; Fairy-Max uses +/- 8000 for mate scores, which fits in 14 bits. So there would be some spare bits there as well. The depth would also never really run upto 256.)

flok

Re: lockless hashing

Post by flok » Fri May 13, 2016 4:42 pm

bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.

petero2
Posts: 581
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: lockless hashing

Post by petero2 » Fri May 13, 2016 5:14 pm

flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.
It is portable if you use C++11 atomics and memory_order_relaxed. You can look at the texel source code for an implementation.

flok

Re: lockless hashing

Post by flok » Fri May 13, 2016 5:22 pm

petero2 wrote:
flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.
It is portable if you use C++11 atomics and memory_order_relaxed. You can look at the texel source code for an implementation.
Atomics use memory barriers as far as I know.
That's not free.

petero2
Posts: 581
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: lockless hashing

Post by petero2 » Fri May 13, 2016 5:26 pm

flok wrote:
petero2 wrote:
flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.
It is portable if you use C++11 atomics and memory_order_relaxed. You can look at the texel source code for an implementation.
Atomics use memory barriers as far as I know.
That's not free.
They don't use memory barriers if you use memory_order_relaxed. On current cpu architectures they are therefore free. There is however no way to know if they will be free in all future cpu architectures.

mar
Posts: 1992
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: lockless hashing

Post by mar » Fri May 13, 2016 6:17 pm

petero2 wrote:On current cpu architectures they are therefore free.
I'm not quite sure about that. If you have many threads that RMW the same cache line, false sharing will kill performance.
Writing into a cacheline that's shared among cores will force them to be invalidated (at least on x86/x64).
In the spinlock sheme mentioned above [hash_index & mask] it would be definitely a good idea to make them cacheline-aligned.

When I tested atomic increment/decrement for refcounting some time ago, even with no contention atomics were about 9 times slower than plain inc/dec IIRC.

petero2
Posts: 581
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: lockless hashing

Post by petero2 » Fri May 13, 2016 7:13 pm

mar wrote:
petero2 wrote:On current cpu architectures they are therefore free.
I'm not quite sure about that. If you have many threads that RMW the same cache line, false sharing will kill performance.
There are no RMW operations involved though when using memory_order_relaxed. Here is the relevant parts of the source code I use:

Code: Select all

struct TTEntryStorage {
    std&#58;&#58;atomic<U64> key;
    std&#58;&#58;atomic<U64> data;
&#125;;

class TTEntry &#123;
    U64 key;
    U64 data;
&#125;;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;store&#40;TTEntryStorage& ent&#41; &#123;
    ent.key.store&#40;key ^ data, std&#58;&#58;memory_order_relaxed&#41;;
    ent.data.store&#40;data, std&#58;&#58;memory_order_relaxed&#41;;
&#125;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;load&#40;const TTEntryStorage& ent&#41; &#123;
    key = ent.key.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    data = ent.data.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    key ^= data;
&#125;
When I compile this I get the following assembly code:

Code: Select all

store&#58;
	movq	8&#40;%rdi&#41;, %rax
	xorq	(%rdi&#41;, %rax
	movq	%rax, (%rsi&#41;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, 8&#40;%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	%rax, (%rdi&#41;
	movq	8&#40;%rsi&#41;, %rax
	xorq	%rax, (%rdi&#41;
	movq	%rax, 8&#40;%rdi&#41;
	ret
There are no locked instructions and no memory barriers. I get almost the same assembly code if I change the TTEntryStorage to U64:

Code: Select all

store&#58;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, %rdx
	xorq	(%rdi&#41;, %rdx
	movq	%rax, 8&#40;%rsi&#41;
	movq	%rdx, (%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	8&#40;%rsi&#41;, %rdx
	xorq	%rdx, %rax
	movq	%rdx, 8&#40;%rdi&#41;
	movq	%rax, (%rdi&#41;
	ret
The only difference is that the compiler missed an optimization opportunity when using memory_order_relaxed, so that it performs an extra read (in the store case) or write (in the load case) from thread-local memory (which will be an L1 cache hit). I doubt this has a measurable speed impact, but it can be fixed at the cost of making the code longer:

Code: Select all

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;store&#40;TTEntryStorage& ent&#41; &#123;
    U64 k = key;
    U64 d = data;
    ent.key.store&#40;k ^ d, std&#58;&#58;memory_order_relaxed&#41;;
    ent.data.store&#40;d, std&#58;&#58;memory_order_relaxed&#41;;
&#125;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;load&#40;const TTEntryStorage& ent&#41; &#123;
    U64 k = ent.key.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    U64 d = ent.data.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    k ^= d;
    key = k;
    data = d;
&#125;
With that source code I get the following assembly code:

Code: Select all

store&#58;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, %rdx
	xorq	(%rdi&#41;, %rdx
	movq	%rdx, (%rsi&#41;
	movq	%rax, 8&#40;%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	8&#40;%rsi&#41;, %rdx
	xorq	%rdx, %rax
	movq	%rdx, 8&#40;%rdi&#41;
	movq	%rax, (%rdi&#41;
	ret
Now the only difference is that the compiler swapped the order of the last two move instructions in the store function.

In conclusion, using memory_order_relaxed for lockless hashing can be made to have zero overhead on the x64 architecture.

bob
Posts: 20471
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: lockless hashing

Post by bob » Fri May 13, 2016 10:32 pm

flok wrote:
bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.
How is your parallel search scaling with 12 cores? If it does poorly, how do you know what lock overhead looks like? It is pretty difficult to measure it.

flok

Re: lockless hashing

Post by flok » Fri May 13, 2016 10:39 pm

bob wrote:
flok wrote:
bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.
How is your parallel search scaling with 12 cores? If it does poorly, how do you know what lock overhead looks like? It is pretty difficult to measure it.
Well, to be honest, I'm not sure.
As mentioned in an other thread: it uses almost all cpu time with hardly any nps or time-to-depth gain.

jdart
Posts: 3814
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: lockless hashing

Post by jdart » Fri May 13, 2016 10:47 pm

I have 32-bit scores (I put this in to aid the auto-tuning stuff I have). And I store both the static score if available and the search score. So I have 3 64-bit quantities per hash entry, or 24 bytes. There is a little wasted space but not much.

--Jon

Post Reply