In the hash code, I had a few of these:
entry = xxx & hash_mask;
I replaced each one by this:
entry = xxx % (hash_mask + 1);
just for a quick test. There was about a 5% slowdown by doing this, which shows that the cost of the divide was a bit more than I had originally guessed...
crafty new hash scheme question
Moderator: Ras
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: crafty new hash scheme question
Right again. The 32 bit code with the appropriate casts would work. For 64 bit, you need assembly code or a library function to get the high 64 bits of the result of a 64*64 bit multiply. I believe that the Microsoft CRT has such a function, but I can't find it.bob wrote:This is not legal. On a Cray shifting 64 bits is really a 0 bit shift as the shift count is only 6 bits wide. Intel won't allow 64 bit shifts. You can only shift 63 bits on X86_64. There is no 128 bit shifts using the normal registers.jwes wrote:You're right. My code only works up to 4G hash entries or 64GB hash memory, but will be faster on 32 bit OSes. It should bebob wrote:How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...jwes wrote:There is an even simpler way that works for any hash size. If n is the number of entries,hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = ((unsigned int32) hashKey * n) >> 32;
the compiler should be able to reduce this to just a multiply (on x86).
index = ((unsigned int64) hashKey * n) >> 64;
-
- Posts: 28387
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: crafty new hash scheme question - % test results
What makes it extra painful is that the hash probe is already the longest-latency operation in your code, and the calculation of the address is in series with it on the critical path. So there is no way for the optimizer or out-of-order execution to hide the latency. It would not be so bad if you were doing the division to decide which element in the cache line you are fetching you should access, as then the operations could be performed in parallel...bob wrote:There was about a 5% slowdown by doing this, which shows that the cost of the divide was a bit more than I had originally guessed...
-
- Posts: 922
- Joined: Sun Nov 19, 2006 9:16 pm
- Location: Russia
- Full name: Aleks Peshkov
Re: crafty new hash scheme question - % test results
It may payoff to calculate and store TT key together with Zobrist key when Zobrist is calculated.hgm wrote:So there is no way for the optimizer or out-of-order execution to hide the latency. It would not be so bad if you were doing the division to decide which element in the cache line you are fetching you should access, as then the operations could be performed in parallel...
-
- Posts: 2251
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: crafty new hash scheme question
__UMULH, __umulh or _umul128jwes wrote: Right again. The 32 bit code with the appropriate casts would work. For 64 bit, you need assembly code or a library function to get the high 64 bits of the result of a 64*64 bit multiply. I believe that the Microsoft CRT has such a function, but I can't find it.
Code: Select all
unsigned __int64 __umulh(
unsigned __int64 a,
unsigned __int64 b
);
-
- Posts: 28387
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: crafty new hash scheme question - % test results
Not sure what you mean here. Is TT key the same thing as hash-table index? Why would you ever want to store that? You only need it once, not? The critcal path is deciding upon a move to search, update the key accordingly, derive the index, fetch the cache line, compare the signature. In parallel you can peform the move on the board and in the piece list, check for repetitions (after you have the key), pre-emptively call Search() for the next level.Aleks Peshkov wrote:It may payoff to calculate and store TT key together with Zobrist key when Zobrist is calculated.
-
- Posts: 1922
- Joined: Thu Mar 09, 2006 12:51 am
- Location: Earth
Re: crafty new hash scheme question - % test results
It's actually not such a bad idea on first thought. You're usually going to calculate the hash table index anyways, and doing it early can allow OOE to absorb the cost of the div. If you're already prefetching though, this is pretty much the same idea. Keeping the index around could be slightly useful for the hash store at the end of search, though.hgm wrote:Not sure what you mean here. Is TT key the same thing as hash-table index? Why would you ever want to store that? You only need it once, not? The critcal path is deciding upon a move to search, update the key accordingly, derive the index, fetch the cache line, compare the signature. In parallel you can peform the move on the board and in the piece list, check for repetitions (after you have the key), pre-emptively call Search() for the next level.Aleks Peshkov wrote:It may payoff to calculate and store TT key together with Zobrist key when Zobrist is calculated.