crafty new hash scheme question

bob · Post by **bob** » Mon Dec 28, 2009 5:50 pm

In the hash code, I had a few of these:

entry = xxx & hash_mask;

I replaced each one by this:

entry = xxx % (hash_mask + 1);

just for a quick test. There was about a 5% slowdown by doing this, which shows that the cost of the divide was a bit more than I had originally guessed...

jwes · Post by **jwes** » Mon Dec 28, 2009 11:51 pm

bob wrote:
jwes wrote:
bob wrote:
jwes wrote:
hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
There is an even simpler way that works for any hash size. If n is the number of entries,

index = ((unsigned int32) hashKey * n) >> 32;

the compiler should be able to reduce this to just a multiply (on x86).
How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...
You're right. My code only works up to 4G hash entries or 64GB hash memory, but will be faster on 32 bit OSes. It should be

index = ((unsigned int64) hashKey * n) >> 64;
This is not legal. On a Cray shifting 64 bits is really a 0 bit shift as the shift count is only 6 bits wide. Intel won't allow 64 bit shifts. You can only shift 63 bits on X86_64. There is no 128 bit shifts using the normal registers.

Right again. The 32 bit code with the appropriate casts would work. For 64 bit, you need assembly code or a library function to get the high 64 bits of the result of a 64*64 bit multiply. I believe that the Microsoft CRT has such a function, but I can't find it.

hgm · Post by **hgm** » Mon Dec 28, 2009 11:58 pm

bob wrote:There was about a 5% slowdown by doing this, which shows that the cost of the divide was a bit more than I had originally guessed...

What makes it extra painful is that the hash probe is already the longest-latency operation in your code, and the calculation of the address is in series with it on the critical path. So there is no way for the optimizer or out-of-order execution to hide the latency. It would not be so bad if you were doing the division to decide which element in the cache line you are fetching you should access, as then the operations could be performed in parallel...

Aleks Peshkov · Post by **Aleks Peshkov** » Tue Dec 29, 2009 6:57 am

hgm wrote:So there is no way for the optimizer or out-of-order execution to hide the latency. It would not be so bad if you were doing the division to decide which element in the cache line you are fetching you should access, as then the operations could be performed in parallel...

It may payoff to calculate and store TT key together with Zobrist key when Zobrist is calculated.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Dec 29, 2009 9:26 am

jwes wrote: Right again. The 32 bit code with the appropriate casts would work. For 64 bit, you need assembly code or a library function to get the high 64 bits of the result of a 64*64 bit multiply. I believe that the Microsoft CRT has such a function, but I can't find it.

__UMULH, __umulh or _umul128

Code: Select all

unsigned __int64 __umulh( 
   unsigned __int64 a, 
   unsigned __int64 b 
);

hgm · Post by **hgm** » Tue Dec 29, 2009 10:13 am

Aleks Peshkov wrote:It may payoff to calculate and store TT key together with Zobrist key when Zobrist is calculated.

Not sure what you mean here. Is TT key the same thing as hash-table index? Why would you ever want to store that? You only need it once, not? The critcal path is deciding upon a move to search, update the key accordingly, derive the index, fetch the cache line, compare the signature. In parallel you can peform the move on the board and in the piece list, check for repetitions (after you have the key), pre-emptively call Search() for the next level.

Zach Wegner · Post by **Zach Wegner** » Tue Dec 29, 2009 7:01 pm

hgm wrote:
Aleks Peshkov wrote:It may payoff to calculate and store TT key together with Zobrist key when Zobrist is calculated.
Not sure what you mean here. Is TT key the same thing as hash-table index? Why would you ever want to store that? You only need it once, not? The critcal path is deciding upon a move to search, update the key accordingly, derive the index, fetch the cache line, compare the signature. In parallel you can peform the move on the board and in the piece list, check for repetitions (after you have the key), pre-emptively call Search() for the next level.

It's actually not such a bad idea on first thought. You're usually going to calculate the hash table index anyways, and doing it early can allow OOE to absorb the cost of the div. If you're already prefetching though, this is pretty much the same idea. Keeping the index around could be slightly useful for the hash store at the end of search, though.

crafty new hash scheme question

Re: crafty new hash scheme question - % test results

Re: crafty new hash scheme question

Re: crafty new hash scheme question - % test results

Re: crafty new hash scheme question - % test results

Re: crafty new hash scheme question

Re: crafty new hash scheme question - % test results

Re: crafty new hash scheme question - % test results