crafty new hash scheme question

Cardoso · Post by **Cardoso** » Sat Dec 26, 2009 3:34 pm

Hi Bob,
I was wandering what made you switch to the new hash scheme in crafty 23.1.

Was it a 'makes sense and intuitive" decision or do you have some testing data that made you choose the new aproach?

I would apreciate very much your comments.
Best regards,
Alvaro

bob · Post by **bob** » Sat Dec 26, 2009 9:25 pm

Cardoso wrote:Hi Bob,
I was wandering what made you switch to the new hash scheme in crafty 23.1.

Was it a 'makes sense and intuitive" decision or do you have some testing data that made you choose the new aproach?

I would apreciate very much your comments.
Best regards,
Alvaro

I was trying to address a cache issue. Previously I had three 16-byte entries in a "bucket" where the first entry was the "depth-preferred" entry and the other two were the "always-replace" entries (only one could be replaced, using an extra bit from the hash signature to select it). That gives 48 bytes, with cache lines typically being either 32 or 64 (yes, some 128's on PIVs and there are even longer lines on other processors besides intel).

If you think about what happens on a TT probe, the 3 entries you want will lie completely in a cache line (assume 64 bytes here) half the time, and the other half of the time they will be split across two cache lines, since the set of three is only 48 bytes.

I decided to go to 64 byte buckets and solve this completely, since I now address a 64 byte "bucket" (4 entries) that will always lie exactly within one cache line (I force the address of the first hash entry to lie on a 64 bit boundary with the obvious trick). The net effect is that each hash probe either is a cache hit, or results in a single cache line fill, where before I got about the same hit rate, but 1/2 of the cache misses had to bring in two lines which was slower.

The current replacement scheme is still an "always" store, based on depth-preferred replacement. It is a little faster, performance-wise. I didn't notice anything significantly different with respect to normal hash issues. Just fewer cache line fills which translates to a NPS improvement.

There is a small down-side, in that now the hash table is a perfect power of 2 size, which means you can never use more than 1/2 of total memory. I am going to play with using the % operator to eliminate this restriction as I suspect it won't hurt a thing. With the old version, hash could use 3/4 of memory, leaving room for code and other stuff. Now it can only use 1/2 of memory which can leave a lot of unused memory on big machines.

hgm · Post by **hgm** » Sat Dec 26, 2009 11:57 pm

A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.

jwes · Post by **jwes** » Sun Dec 27, 2009 6:42 am

hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.

There is an even simpler way that works for any hash size. If n is the number of entries,

index = ((unsigned int32) hashKey * n) >> 32;

the compiler should be able to reduce this to just a multiply (on x86).

bob · Post by **bob** » Sun Dec 27, 2009 7:03 pm

hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.

There are other tricks as well, using LEA. I am not sure that the divide overhead is significant. I have added a divide in places in the past and could not see the penalty. I will try to run this test in a few minutes to see what happens if I replace the AND with a %, without changing anything else, to see what the cost is.

bob · Post by **bob** » Sun Dec 27, 2009 7:08 pm

jwes wrote:
hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
There is an even simpler way that works for any hash size. If n is the number of entries,

index = ((unsigned int32) hashKey * n) >> 32;

the compiler should be able to reduce this to just a multiply (on x86).

How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...

Zach Wegner · Post by **Zach Wegner** » Sun Dec 27, 2009 7:17 pm

bob wrote:
jwes wrote:
hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
There is an even simpler way that works for any hash size. If n is the number of entries,

index = ((unsigned int32) hashKey * n) >> 32;

the compiler should be able to reduce this to just a multiply (on x86).
How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...

At the very least, to work on 32-bit machines, the cast should be to a 64-bit value. It can work on 64-bit machines too, but it needs assembly (taking the upper 64-bit result from the multiply, not possible in C).

hgm · Post by **hgm** » Sun Dec 27, 2009 8:33 pm

bob wrote:There are other tricks as well, using LEA. I am not sure that the divide overhead is significant. I have added a divide in places in the past and could not see the penalty. I will try to run this test in a few minutes to see what happens if I replace the AND with a %, without changing anything else, to see what the cost is.

Yes, LEA can do multiplies by 3, 5, 7 or 9 cheaply too. But the trick is to get rid of the divide. Note that divides by a _constant_ are replaced by the compiler by a multiply with the inverse, and this is probably the case you have been dealing it. This does not work for modulo, though.

jwes · Post by **jwes** » Sun Dec 27, 2009 10:36 pm

bob wrote:
jwes wrote:
hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
There is an even simpler way that works for any hash size. If n is the number of entries,

index = ((unsigned int32) hashKey * n) >> 32;

the compiler should be able to reduce this to just a multiply (on x86).
How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...

You're right. My code only works up to 4G hash entries or 64GB hash memory, but will be faster on 32 bit OSes. It should be

index = ((unsigned int64) hashKey * n) >> 64;

bob · Post by **bob** » Mon Dec 28, 2009 5:24 pm

jwes wrote:
bob wrote:
jwes wrote:
hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:

index = 7*(hashKey & ((1<<60)-1)) >> N;

That requires only an AND, multiply and shift, which is usually faster than modulo.

I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
There is an even simpler way that works for any hash size. If n is the number of entries,

index = ((unsigned int32) hashKey * n) >> 32;

the compiler should be able to reduce this to just a multiply (on x86).
How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...
You're right. My code only works up to 4G hash entries or 64GB hash memory, but will be faster on 32 bit OSes. It should be

index = ((unsigned int64) hashKey * n) >> 64;

This is not legal. On a Cray shifting 64 bits is really a 0 bit shift as the shift count is only 6 bits wide. Intel won't allow 64 bit shifts. You can only shift 63 bits on X86_64. There is no 128 bit shifts using the normal registers.

crafty new hash scheme question

crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question

Re: crafty new hash scheme question