lockless hashing

syzygy · Post by **syzygy** » Sat May 14, 2016 1:00 am

flok wrote:
bob wrote:
flok wrote:
bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.
How is your parallel search scaling with 12 cores? If it does poorly, how do you know what lock overhead looks like? It is pretty difficult to measure it.
Well, to be honest, I'm not sure.
As mentioned in an other thread: it uses almost all cpu time with hardly any nps or time-to-depth gain.

So, that sounds like gigantic locking overhead.

Just to be sure, do you have a CPU with 12 (hyper)threads?
What kind of locks are you using?

syzygy · Post by **syzygy** » Sat May 14, 2016 1:01 am

jdart wrote:I have 32-bit scores (I put this in to aid the auto-tuning stuff I have). And I store both the static score if available and the search score. So I have 3 64-bit quantities per hash entry, or 24 bytes. There is a little wasted space but not much.

Would it not be sufficient to use 32-bit scores "internally" (within the evaluation) and 16-bit scores "externally" (in hash and search)?

flok · Post by **flok** » Sat May 14, 2016 1:06 am

syzygy wrote: Well, to be honest, I'm not sure.
As mentioned in an other thread: it uses almost all cpu time with hardly any nps or time-to-depth gain.

So, that sounds like gigantic locking overhead.[/quote]

That is also without locking.

Just to be sure, do you have a CPU with 12 (hyper)threads?
What kind of locks are you using?

With HT indeed.
The locks are r/w locks. So tt lookup performs a read-lock/shared lock.

bob · Post by **bob** » Sat May 14, 2016 2:46 am

petero2 wrote:
mar wrote:
petero2 wrote:On current cpu architectures they are therefore free.
I'm not quite sure about that. If you have many threads that RMW the same cache line, false sharing will kill performance.
There are no RMW operations involved though when using memory_order_relaxed. Here is the relevant parts of the source code I use:
Code: Select all
struct TTEntryStorage &#123;
    std&#58;&#58;atomic<U64> key;
    std&#58;&#58;atomic<U64> data;
&#125;;

class TTEntry &#123;
    U64 key;
    U64 data;
&#125;;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;store&#40;TTEntryStorage& ent&#41; &#123;
    ent.key.store&#40;key ^ data, std&#58;&#58;memory_order_relaxed&#41;;
    ent.data.store&#40;data, std&#58;&#58;memory_order_relaxed&#41;;
&#125;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;load&#40;const TTEntryStorage& ent&#41; &#123;
    key = ent.key.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    data = ent.data.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    key ^= data;
&#125;
When I compile this I get the following assembly code:
Code: Select all
store&#58;
	movq	8&#40;%rdi&#41;, %rax
	xorq	(%rdi&#41;, %rax
	movq	%rax, (%rsi&#41;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, 8&#40;%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	%rax, (%rdi&#41;
	movq	8&#40;%rsi&#41;, %rax
	xorq	%rax, (%rdi&#41;
	movq	%rax, 8&#40;%rdi&#41;
	ret
There are no locked instructions and no memory barriers. I get almost the same assembly code if I change the TTEntryStorage to U64:
Code: Select all
store&#58;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, %rdx
	xorq	(%rdi&#41;, %rdx
	movq	%rax, 8&#40;%rsi&#41;
	movq	%rdx, (%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	8&#40;%rsi&#41;, %rdx
	xorq	%rdx, %rax
	movq	%rdx, 8&#40;%rdi&#41;
	movq	%rax, (%rdi&#41;
	ret
The only difference is that the compiler missed an optimization opportunity when using memory_order_relaxed, so that it performs an extra read (in the store case) or write (in the load case) from thread-local memory (which will be an L1 cache hit). I doubt this has a measurable speed impact, but it can be fixed at the cost of making the code longer:
Code: Select all
void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;store&#40;TTEntryStorage& ent&#41; &#123;
    U64 k = key;
    U64 d = data;
    ent.key.store&#40;k ^ d, std&#58;&#58;memory_order_relaxed&#41;;
    ent.data.store&#40;d, std&#58;&#58;memory_order_relaxed&#41;;
&#125;

void
TranspositionTable&#58;&#58;TTEntry&#58;&#58;load&#40;const TTEntryStorage& ent&#41; &#123;
    U64 k = ent.key.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    U64 d = ent.data.load&#40;std&#58;&#58;memory_order_relaxed&#41;;
    k ^= d;
    key = k;
    data = d;
&#125;
With that source code I get the following assembly code:
Code: Select all
store&#58;
	movq	8&#40;%rdi&#41;, %rax
	movq	%rax, %rdx
	xorq	(%rdi&#41;, %rdx
	movq	%rdx, (%rsi&#41;
	movq	%rax, 8&#40;%rsi&#41;
	ret

load&#58;
	movq	(%rsi&#41;, %rax
	movq	8&#40;%rsi&#41;, %rdx
	xorq	%rdx, %rax
	movq	%rdx, 8&#40;%rdi&#41;
	movq	%rax, (%rdi&#41;
	ret
Now the only difference is that the compiler swapped the order of the last two move instructions in the store function.

In conclusion, using memory_order_relaxed for lockless hashing can be made to have zero overhead on the x64 architecture.

That's not the problem he is describing. If you have multiple entries in a cache block, and different threads modify different entries but the same cache block, there are LOTS of invalidates and forwarding (assuming modern MOESI/MESIF type cache coherency protocols. The idea of "false sharing" is having two counters in adjacent memory addresses, where each thread ONLY modifies its specific counter. But since the counters share a cache block, performance goes to hell due to all the invalidate/forwarding traffic...

bob · Post by **bob** » Sat May 14, 2016 2:48 am

flok wrote:
bob wrote:
flok wrote:
bob wrote:There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.
I've measured it with 12 threads; same overhead.
How is your parallel search scaling with 12 cores? If it does poorly, how do you know what lock overhead looks like? It is pretty difficult to measure it.
Well, to be honest, I'm not sure.
As mentioned in an other thread: it uses almost all cpu time with hardly any nps or time-to-depth gain.

Then one might argue you have 100% overhead... Need more data but until you get a speedup of some sort, you won't be able to measure the effect of locking and how it hurts speedup.

jdart · Post by **jdart** » Sat May 14, 2016 8:54 am

syzygy wrote:
jdart wrote: Would it not be sufficient to use 32-bit scores "internally" (within the evaluation) and 16-bit scores "externally" (in hash and search)?
Yeah. Or since I build a separate tuning executable the resolution could be set differently fora that.

syzygy · Post by **syzygy** » Sat May 14, 2016 10:48 am

bob wrote:That's not the problem he is describing.

The point is that that problem does not exist in what Peter was describing, namely the use of C++11/C11 atomics to implement the xor-trick correctly.

bob · Post by **bob** » Sat May 14, 2016 5:15 pm

syzygy wrote:
bob wrote:That's not the problem he is describing.
The point is that that problem does not exist in what Peter was describing, namely the use of C++11/C11 atomics to implement the xor-trick correctly.

It doesn't stop "false sharing". Nobody is going to write a compiler where each separate variable is put into a separate cache block by spacing to 64 byte alignment. That would bloat code size and kill performance.

One has to take particular care when sharing data across threads so that two threads are not updating the same cache block frequently.

syzygy · Post by **syzygy** » Sat May 14, 2016 5:17 pm

bob wrote:
syzygy wrote:
bob wrote:That's not the problem he is describing.
The point is that that problem does not exist in what Peter was describing, namely the use of C++11/C11 atomics to implement the xor-trick correctly.
It doesn't stop "false sharing".

Correctly implementing the xor-trick using C++11/C11 atomics has nothing to do with false sharing, so I think you are talking about something entirely different.

We're in this subthread: http://talkchess.com/forum/viewtopic.ph ... 73&t=60122
See, it's just about "lockless hashing" which is really the xor-trick. Nothing to do with Folkert's alternative approach of using mutexes which is discussed in a different subthread.

bob · Post by **bob** » Sat May 14, 2016 6:17 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:That's not the problem he is describing.
The point is that that problem does not exist in what Peter was describing, namely the use of C++11/C11 atomics to implement the xor-trick correctly.
It doesn't stop "false sharing".
Correctly implementing the xor-trick using C++11/C11 atomics has nothing to do with false sharing, so I think you are talking about something entirely different.

We're in this subthread: http://talkchess.com/forum/viewtopic.ph ... 73&t=60122
See, it's just about "lockless hashing" which is really the xor-trick. Nothing to do with Folkert's alternative approach of using mutexes which is discussed in a different subthread.

If you back up about 6 posts in this sub-thread, Martin was talking explicitly about false-sharing, which is what I commented on...

lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing