Hi Bob,
I was wandering what made you switch to the new hash scheme in crafty 23.1.
Was it a 'makes sense and intuitive" decision or do you have some testing data that made you choose the new aproach?
I would apreciate very much your comments.
Best regards,
Alvaro
crafty new hash scheme question
Moderator: Ras
-
- Posts: 363
- Joined: Thu Mar 16, 2006 7:39 pm
- Location: Portugal
- Full name: Alvaro Cardoso
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: crafty new hash scheme question
I was trying to address a cache issue. Previously I had three 16-byte entries in a "bucket" where the first entry was the "depth-preferred" entry and the other two were the "always-replace" entries (only one could be replaced, using an extra bit from the hash signature to select it). That gives 48 bytes, with cache lines typically being either 32 or 64 (yes, some 128's on PIVs and there are even longer lines on other processors besides intel).Cardoso wrote:Hi Bob,
I was wandering what made you switch to the new hash scheme in crafty 23.1.
Was it a 'makes sense and intuitive" decision or do you have some testing data that made you choose the new aproach?
I would apreciate very much your comments.
Best regards,
Alvaro
If you think about what happens on a TT probe, the 3 entries you want will lie completely in a cache line (assume 64 bytes here) half the time, and the other half of the time they will be split across two cache lines, since the set of three is only 48 bytes.
I decided to go to 64 byte buckets and solve this completely, since I now address a 64 byte "bucket" (4 entries) that will always lie exactly within one cache line (I force the address of the first hash entry to lie on a 64 bit boundary with the obvious trick). The net effect is that each hash probe either is a cache hit, or results in a single cache line fill, where before I got about the same hit rate, but 1/2 of the cache misses had to bring in two lines which was slower.
The current replacement scheme is still an "always" store, based on depth-preferred replacement. It is a little faster, performance-wise. I didn't notice anything significantly different with respect to normal hash issues. Just fewer cache line fills which translates to a NPS improvement.
There is a small down-side, in that now the hash table is a perfect power of 2 size, which means you can never use more than 1/2 of total memory. I am going to play with using the % operator to eliminate this restriction as I suspect it won't hurt a thing. With the old version, hash could use 3/4 of memory, leaving room for code and other stuff. Now it can only use 1/2 of memory which can leave a lot of unused memory on big machines.
-
- Posts: 28387
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: crafty new hash scheme question
A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: crafty new hash scheme question
There is an even simpler way that works for any hash size. If n is the number of entries,hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = ((unsigned int32) hashKey * n) >> 32;
the compiler should be able to reduce this to just a multiply (on x86).
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: crafty new hash scheme question
There are other tricks as well, using LEA. I am not sure that the divide overhead is significant. I have added a divide in places in the past and could not see the penalty. I will try to run this test in a few minutes to see what happens if I replace the AND with a %, without changing anything else, to see what the cost is.hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: crafty new hash scheme question
How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...jwes wrote:There is an even simpler way that works for any hash size. If n is the number of entries,hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = ((unsigned int32) hashKey * n) >> 32;
the compiler should be able to reduce this to just a multiply (on x86).
-
- Posts: 1922
- Joined: Thu Mar 09, 2006 12:51 am
- Location: Earth
Re: crafty new hash scheme question
At the very least, to work on 32-bit machines, the cast should be to a 64-bit value. It can work on 64-bit machines too, but it needs assembly (taking the upper 64-bit result from the multiply, not possible in C).bob wrote:How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...jwes wrote:There is an even simpler way that works for any hash size. If n is the number of entries,hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = ((unsigned int32) hashKey * n) >> 32;
the compiler should be able to reduce this to just a multiply (on x86).
-
- Posts: 28387
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: crafty new hash scheme question
Yes, LEA can do multiplies by 3, 5, 7 or 9 cheaply too. But the trick is to get rid of the divide. Note that divides by a _constant_ are replaced by the compiler by a multiply with the inverse, and this is probably the case you have been dealing it. This does not work for modulo, though.bob wrote:There are other tricks as well, using LEA. I am not sure that the divide overhead is significant. I have added a divide in places in the past and could not see the penalty. I will try to run this test in a few minutes to see what happens if I replace the AND with a %, without changing anything else, to see what the cost is.
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: crafty new hash scheme question
You're right. My code only works up to 4G hash entries or 64GB hash memory, but will be faster on 32 bit OSes. It should bebob wrote:How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...jwes wrote:There is an even simpler way that works for any hash size. If n is the number of entries,hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = ((unsigned int32) hashKey * n) >> 32;
the compiler should be able to reduce this to just a multiply (on x86).
index = ((unsigned int64) hashKey * n) >> 64;
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: crafty new hash scheme question
This is not legal. On a Cray shifting 64 bits is really a 0 bit shift as the shift count is only 6 bits wide. Intel won't allow 64 bit shifts. You can only shift 63 bits on X86_64. There is no 128 bit shifts using the normal registers.jwes wrote:You're right. My code only works up to 4G hash entries or 64GB hash memory, but will be faster on 32 bit OSes. It should bebob wrote:How is that going to work? N can be greater than 2^32 in fact on today's boxes. This has to work for all machines to be useful...jwes wrote:There is an even simpler way that works for any hash size. If n is the number of entries,hgm wrote:A way to avoid division or modulo instructons, and have a hash size that is, say, 7 times a power of two would be to derive the index as:
index = 7*(hashKey & ((1<<60)-1)) >> N;
That requires only an AND, multiply and shift, which is usually faster than modulo.
I use this method to derive number < 5 from the hash key to index the number of the entry within the bucket, as I have 7 entries per bucket, where one is an overflow entry with always-replace policy, while the primary (depth-preferred) hit goes to one of the five overlapping pairs fitted within the other 6. So the number derived from the hash key decides in which pair the node can be stored, and if both these entries already contan a higher draft, the new node goes to the overflow entry.
index = ((unsigned int32) hashKey * n) >> 32;
the compiler should be able to reduce this to just a multiply (on x86).
index = ((unsigned int64) hashKey * n) >> 64;