Hi,
I'm considering changing my TT implementation.
I currently have 2^21 = 2,097,152 'entries' in table.
I cannot fit an entry into a single 64-bit long, so I use 2.
So my TT has 2^22 elements where TT[a] and TT[a+1] correspond to one entry.
Still using a 64-bit key, I found I can represent an 'entry' in exactly 80 bits, which is 1 (64 bit)long and 1 (16 bit) short [using java here]
Since I have an array with 2^22 longs already, creating an array of 2^22 shorts, means I can now have 2^22 'entries', doubling what I previously had.
The drawback is that I have to probe/record into 2 big arrays instead of just 1 big one.
So I wanted to know the best way of comparing the results when program is run. I guess search depth is what we want improved, but its hard to measure.
Any thoughts? Thanks
Comparing results
Moderators: hgm, Rebel, chrisw
Re: Comparing results
I doubt you can completely represent information about a searched node in 64-bits (unless you are using very small hashes). I use 14 bytes per entry but I have implemented it as different arrays of longs, bytes, and shorts. This definitely had a much larger improvement over storing an array of Objects (using Java here too). Java wastes a lot of space in storing object references and does not optimize an array of objects (a boolean might actually take 4 bytes if it is stand-alone).cms271828 wrote:Still using a 64-bit key, I found I can represent an 'entry' in exactly 80 bits, which is 1 (64 bit)long and 1 (16 bit) short [using java here]
Since I have an array with 2^22 longs already, creating an array of 2^22 shorts, means I can now have 2^22 'entries', doubling what I previously had.
The drawback is that I have to probe/record into 2 big arrays instead of just 1 big one.
Last edited by verminox on Mon Jun 01, 2009 8:06 pm, edited 1 time in total.
Re: Comparing results
In Java, accessing TT[a] and TT[a+1] will have almost equal performance to TTLong[a] and TTShort[a] so I don't see why this improvement in space utilization should have any effects on your search.cms271828 wrote:I cannot fit an entry into a single 64-bit long, so I use 2.
So my TT has 2^22 elements where TT[a] and TT[a+1] correspond to one entry.
Still using a 64-bit key, I found I can represent an 'entry' in exactly 80 bits, which is 1 (64 bit)long and 1 (16 bit) short [using java here]
Since I have an array with 2^22 longs already, creating an array of 2^22 shorts, means I can now have 2^22 'entries', doubling what I previously had.
The drawback is that I have to probe/record into 2 big arrays instead of just 1 big one.
-
- Posts: 316
- Joined: Wed Apr 12, 2006 10:47 pm
Re: Comparing results
Thanks, I didn't think that would be the case though, I thought it would have to load in the short[] , whereas with long[a] and long[a+1], its already loaded in after you've called long[a].
But I guess, I'm using 25% more table space, to get 100% more entries, so the bigger TT should give a nice performance boost I hope.
But I guess, I'm using 25% more table space, to get 100% more entries, so the bigger TT should give a nice performance boost I hope.
Colin
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: Comparing results
First of all, don't expect too much strength improvement by optimizing TT-algorithms.
Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.
I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.
Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag
Many other programs even take advantage of the fact, that every time the ram is accessed, a total cache line is collected (64byte on many processors). So it doesn't really matter whether you only need 1 entry (16byte) or 4 (64byte). This can be used by storing 4 entries per index and thus apply some more sophisticated replacement scheme.
Concerning the main question, the changes you propose should return the same results for the same hash-entry count. So if you want to compare things, what really matters is just the nps, or how many seconds needed to reach a certain depth. Anyway, as I started my post I don't expect you will get significant changes as the performance critical parts are somewhere else.
Maybe you can get some improvements by changing the replacement scheme. Especially for small TT large differences can be measured, but the bigger they get the smaller the difference.
Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.
I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.
Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag
Many other programs even take advantage of the fact, that every time the ram is accessed, a total cache line is collected (64byte on many processors). So it doesn't really matter whether you only need 1 entry (16byte) or 4 (64byte). This can be used by storing 4 entries per index and thus apply some more sophisticated replacement scheme.
Concerning the main question, the changes you propose should return the same results for the same hash-entry count. So if you want to compare things, what really matters is just the nps, or how many seconds needed to reach a certain depth. Anyway, as I started my post I don't expect you will get significant changes as the performance critical parts are somewhere else.
Maybe you can get some improvements by changing the replacement scheme. Especially for small TT large differences can be measured, but the bigger they get the smaller the difference.
Re: Comparing results
I think the JVM handles things much differently. Without arrays of primitive data the JVM will just pad extra bytes to align the word in memory, wasting a lot of space.Codeman wrote:Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.
Again, the JVM internally performs many more instructions than a traditional C program would. For example, for every array access the JVM checks if the index is within bounds or else invokes the exception handling mechanism. With all this overhead, I doubt accessing two arrays would make much difference over accessing one.Codeman wrote:I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.
Is a 4-byte hash enough? I read somewhere that 32-bit hashes were more prone to full collisions and therefore incorrect search results. How true is this?Codeman wrote:Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag
-
- Posts: 27811
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Comparing results
With a 4-byt hash, you will have a collision that would have been avoided using the full key every 2^32 probes. At 1Mnps, that means every 4 million seconds, or about once a month.
Unless a collission can crash your engine, even having many thousands of collissions in a single search does not measurbly affect performance. (Perhaps Bob could give us more exact numbers here.)
In Joker I use 9-byte entries, including a 4-byte signature. I never tried to shrink it to 8 bytes. With a 3 bytes signature, I would have a collission every 5 seconds (at 1 Mnps), as Joker probes 3 entries per access. This might still be bearable.
But there is very little gain to be expected from this. Currently Joker has 7 entries per 64-byte cache line. That could be increased to 8 entries. That is 14% more entries, which would lead to about 1% faster search. (The search speed scales as the twelfth root of hash sizes.)
Unless a collission can crash your engine, even having many thousands of collissions in a single search does not measurbly affect performance. (Perhaps Bob could give us more exact numbers here.)
In Joker I use 9-byte entries, including a 4-byte signature. I never tried to shrink it to 8 bytes. With a 3 bytes signature, I would have a collission every 5 seconds (at 1 Mnps), as Joker probes 3 entries per access. This might still be bearable.
But there is very little gain to be expected from this. Currently Joker has 7 entries per 64-byte cache line. That could be increased to 8 entries. That is 14% more entries, which would lead to about 1% faster search. (The search speed scales as the twelfth root of hash sizes.)
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: Comparing results
well, it does this to improve performance. When I am compiling in C++ usually I manually set the compiler to align to 16 bytes.verminox wrote:I think the JVM handles things much differently. Without arrays of primitive data the JVM will just pad extra bytes to align the word in memory, wasting a lot of space.Codeman wrote:Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.
All this exception handling is cheap compared to the time it takes to random access data from the ram.verminox wrote:Again, the JVM internally performs many more instructions than a traditional C program would. For example, for every array access the JVM checks if the index is within bounds or else invokes the exception handling mechanism. With all this overhead, I doubt accessing two arrays would make much difference over accessing one.Codeman wrote:I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.
Bob published a paper on exactly this topic. Maybe you want to have a look at it: http://www.cis.uab.edu/hyatt/collisions.htmlverminox wrote:Is a 4-byte hash enough? I read somewhere that 32-bit hashes were more prone to full collisions and therefore incorrect search results. How true is this?Codeman wrote:Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag
Re: Comparing results
That would be fine in normal circumstances. But for a hashtable with millions of entries, isn't that wasting millions of bytes? Wouldn't it be worth the extra effort in return for a larger table?Codeman wrote:well, it does this to improve performance. When I am compiling in C++ usually I manually set the compiler to align to 16 bytes.
Hmmm, ok I did not know that. When a local variable is created in a method, does it take just as long to access that method every time? Because I have never made such considerations for memory access while trying to optimize the search.Codeman wrote:All this exception handling is cheap compared to the time it takes to random access data from the ram.
I use a bunch of rather unnecessary counters/stacks for more verbose thinking, never thinking those could be a bottleneck. I mean compared to all that move generation, evaluation, recursion, etc. these counters should hardly matter..... right?
Thanks, I'll have a look at thatCodeman wrote:Bob published a paper on exactly this topic. Maybe you want to have a look at it: http://www.cis.uab.edu/hyatt/collisions.html
-
- Posts: 2129
- Joined: Thu May 29, 2008 10:43 am
Re: Comparing results
it really depends on the platform doesn't it?
for ex:
my understanding is that certain processor instructions sets require that the data to be 16-byte aligned.
and if so, there can be substantial performance advantages...?
i.e. don't know why java would be any different, as the resulting binary would utilize these same instruction sets, and there are many c++ optimization guides that recommend aligning all data.
for ex:
my understanding is that certain processor instructions sets require that the data to be 16-byte aligned.
and if so, there can be substantial performance advantages...?
i.e. don't know why java would be any different, as the resulting binary would utilize these same instruction sets, and there are many c++ optimization guides that recommend aligning all data.
Last edited by kranium on Tue Jun 02, 2009 12:16 pm, edited 2 times in total.