I implemented Zobrist hashing by using rotated random numbers rather than arrays of random numbers, i.e.
hash = hash ^ _rotl64(zob[piecetype], from) ^ _rotl64(zob[piecetype], to)
instead of
hash = hash ^ zob[piecetype][from] ^ zob[piecetype][to]
This means my zobrist table has 16 qwords rather than over 1000 qwords.
I let it run for several days and got these statistics with 256M hash entries and a 64 bit key:
I feel that 1 false hit every few days is acceptable, and saving 8K bytes of frequently referenced data should reduce cache misses.
BTW, I would appreciate it if someone who remembers their statistics better than I do could calculate confidence intervals for false hits per hash probe.
That is about 1 in 2^36 collisions on a set of 2^28. Which is exactly what wouldbe expected for a 64-bit key, as 36 + 28 = 64.
How do you detect false hits? And how did you generate the random piece keys?
In HaChu I use pieceKey[pieceType]*squareKey[squareNumber], rather than rotations, because I have to deal with more than 64 squares (sometimes even an order of magnitude more). But the keys can be 32-bit in this case.
jwes wrote:I implemented Zobrist hashing by using rotated random numbers rather than arrays of random numbers, i.e.
hash = hash ^ _rotl64(zob[piecetype], from) ^ _rotl64(zob[piecetype], to)
instead of
hash = hash ^ zob[piecetype][from] ^ zob[piecetype][to]
This means my zobrist table has 16 qwords rather than over 1000 qwords.
I let it run for several days and got these statistics with 256M hash entries and a 64 bit key:
I feel that 1 false hit every few days is acceptable, and saving 8K bytes of frequently referenced data should reduce cache misses.
BTW, I would appreciate it if someone who remembers their statistics better than I do could calculate confidence intervals for false hits per hash probe.
8KBytes is tiny relative to megabytes of L3 cache we now have on modern CPUs, and even a L3 hit (L1 and L2 miss) should still be much faster than doing a rotation.
Have you benchmarked this? My feeling is using a larger table would still be faster.
In a chess engine, when playing with hash tables much bigger than L3, my feeling is most of the cache is wasted anyways. The engine will almost never access another entry in the same cache line before it's flushed out.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
matthewlai wrote:8KBytes is tiny relative to megabytes of L3 cache we now have on modern CPUs, and even a L3 hit (L1 and L2 miss) should still be much faster than doing a rotation.
Rotation is a single machine instruction, right? With a latency similar to an L1 access.
matthewlai wrote:8KBytes is tiny relative to megabytes of L3 cache we now have on modern CPUs, and even a L3 hit (L1 and L2 miss) should still be much faster than doing a rotation.
Rotation is a single machine instruction, right? With a latency similar to an L1 access.
Is it? I didn't know that.
If that's the case then this is definitely a good idea.
I assumed it would be at least a few instructions.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
jwes wrote:I implemented Zobrist hashing by using rotated random numbers rather than arrays of random numbers, i.e.
hash = hash ^ _rotl64(zob[piecetype], from) ^ _rotl64(zob[piecetype], to)
instead of
hash = hash ^ zob[piecetype][from] ^ zob[piecetype][to]
This means my zobrist table has 16 qwords rather than over 1000 qwords.
I let it run for several days and got these statistics with 256M hash entries and a 64 bit key:
I feel that 1 false hit every few days is acceptable, and saving 8K bytes of frequently referenced data should reduce cache misses.
BTW, I would appreciate it if someone who remembers their statistics better than I do could calculate confidence intervals for false hits per hash probe.
8KBytes is tiny relative to megabytes of L3 cache we now have on modern CPUs, and even a L3 hit (L1 and L2 miss) should still be much faster than doing a rotation.
Have you benchmarked this? My feeling is using a larger table would still be faster.
In a chess engine, when playing with hash tables much bigger than L3, my feeling is most of the cache is wasted anyways. The engine will almost never access another entry in the same cache line before it's flushed out.
The Intel Optimization Manual says ROL has latency of 1 cycle and throughput of 1.5 cycles. It would be hard to benchmark now since the result depends on cache pressure, and my program is not stable.