Comparing results

cms271828 · Post by **cms271828** » Mon Jun 01, 2009 7:36 pm

Hi,

I'm considering changing my TT implementation.
I currently have 2^21 = 2,097,152 'entries' in table.

I cannot fit an entry into a single 64-bit long, so I use 2.
So my TT has 2^22 elements where TT[a] and TT[a+1] correspond to one entry.

Still using a 64-bit key, I found I can represent an 'entry' in exactly 80 bits, which is 1 (64 bit)long and 1 (16 bit) short [using java here]

Since I have an array with 2^22 longs already, creating an array of 2^22 shorts, means I can now have 2^22 'entries', doubling what I previously had.

The drawback is that I have to probe/record into 2 big arrays instead of just 1 big one.

So I wanted to know the best way of comparing the results when program is run. I guess search depth is what we want improved, but its hard to measure.

Any thoughts? Thanks

verminox · Post by **verminox** » Mon Jun 01, 2009 8:00 pm

cms271828 wrote:Still using a 64-bit key, I found I can represent an 'entry' in exactly 80 bits, which is 1 (64 bit)long and 1 (16 bit) short [using java here]

Since I have an array with 2^22 longs already, creating an array of 2^22 shorts, means I can now have 2^22 'entries', doubling what I previously had.

The drawback is that I have to probe/record into 2 big arrays instead of just 1 big one.

I doubt you can completely represent information about a searched node in 64-bits (unless you are using very small hashes). I use 14 bytes per entry but I have implemented it as different arrays of longs, bytes, and shorts. This definitely had a much larger improvement over storing an array of Objects (using Java here too). Java wastes a lot of space in storing object references and does not optimize an array of objects (a boolean might actually take 4 bytes if it is stand-alone).

verminox · Post by **verminox** » Mon Jun 01, 2009 8:06 pm

cms271828 wrote:I cannot fit an entry into a single 64-bit long, so I use 2.
So my TT has 2^22 elements where TT[a] and TT[a+1] correspond to one entry.

Still using a 64-bit key, I found I can represent an 'entry' in exactly 80 bits, which is 1 (64 bit)long and 1 (16 bit) short [using java here]

Since I have an array with 2^22 longs already, creating an array of 2^22 shorts, means I can now have 2^22 'entries', doubling what I previously had.

The drawback is that I have to probe/record into 2 big arrays instead of just 1 big one.

In Java, accessing TT[a] and TT[a+1] will have almost equal performance to TTLong[a] and TTShort[a] so I don't see why this improvement in space utilization should have any effects on your search.

cms271828 · Post by **cms271828** » Mon Jun 01, 2009 8:13 pm

Thanks, I didn't think that would be the case though, I thought it would have to load in the short[] , whereas with long[a] and long[a+1], its already loaded in after you've called long[a].

But I guess, I'm using 25% more table space, to get 100% more entries, so the bigger TT should give a nice performance boost I hope.

Edmund · Post by **Edmund** » Mon Jun 01, 2009 9:28 pm

First of all, don't expect too much strength improvement by optimizing TT-algorithms.

Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.

I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.

Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag

Many other programs even take advantage of the fact, that every time the ram is accessed, a total cache line is collected (64byte on many processors). So it doesn't really matter whether you only need 1 entry (16byte) or 4 (64byte). This can be used by storing 4 entries per index and thus apply some more sophisticated replacement scheme.

Concerning the main question, the changes you propose should return the same results for the same hash-entry count. So if you want to compare things, what really matters is just the nps, or how many seconds needed to reach a certain depth. Anyway, as I started my post I don't expect you will get significant changes as the performance critical parts are somewhere else.

Maybe you can get some improvements by changing the replacement scheme. Especially for small TT large differences can be measured, but the bigger they get the smaller the difference.

verminox · Post by **verminox** » Tue Jun 02, 2009 6:16 am

Codeman wrote:Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.

I think the JVM handles things much differently. Without arrays of primitive data the JVM will just pad extra bytes to align the word in memory, wasting a lot of space.

Codeman wrote:I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.

Again, the JVM internally performs many more instructions than a traditional C program would. For example, for every array access the JVM checks if the index is within bounds or else invokes the exception handling mechanism. With all this overhead, I doubt accessing two arrays would make much difference over accessing one.

Codeman wrote:Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag

Is a 4-byte hash enough? I read somewhere that 32-bit hashes were more prone to full collisions and therefore incorrect search results. How true is this?

hgm · Post by **hgm** » Tue Jun 02, 2009 8:46 am

With a 4-byt hash, you will have a collision that would have been avoided using the full key every 2^32 probes. At 1Mnps, that means every 4 million seconds, or about once a month.

Unless a collission can crash your engine, even having many thousands of collissions in a single search does not measurbly affect performance. (Perhaps Bob could give us more exact numbers here.)

In Joker I use 9-byte entries, including a 4-byte signature. I never tried to shrink it to 8 bytes. With a 3 bytes signature, I would have a collission every 5 seconds (at 1 Mnps), as Joker probes 3 entries per access. This might still be bearable.

But there is very little gain to be expected from this. Currently Joker has 7 entries per 64-byte cache line. That could be increased to 8 entries. That is 14% more entries, which would lead to about 1% faster search. (The search speed scales as the twelfth root of hash sizes.)

Edmund · Post by **Edmund** » Tue Jun 02, 2009 10:23 am

verminox wrote:
Codeman wrote:Generally it should be faster to access 16byte aligned structs, rather than any unusual 10 byte.
I think the JVM handles things much differently. Without arrays of primitive data the JVM will just pad extra bytes to align the word in memory, wasting a lot of space.

well, it does this to improve performance. When I am compiling in C++ usually I manually set the compiler to align to 16 bytes.

verminox wrote:
Codeman wrote:I don't agree with your idea to probe 2 arrays. You must be aware that in the majority of cases the entry in question won't be cached in the small Processor caches. So you have to collect the data directly from the ram. On my computer I tested that one random access on the ram takes more than 100 instruction-cycles. Having 2 arrays would double this time. Note that the actual probing of the array is the bottleneck of the operation.
Again, the JVM internally performs many more instructions than a traditional C program would. For example, for every array access the JVM checks if the index is within bounds or else invokes the exception handling mechanism. With all this overhead, I doubt accessing two arrays would make much difference over accessing one.

All this exception handling is cheap compared to the time it takes to random access data from the ram.

verminox wrote:
Codeman wrote:Theoretically it is possible to just use 8byte per TT-entry, but I don't know any programs using it:
- 4byte hash
- 2byte value
- 1byte move (e.g. by encoding the movenumber)
- 6bit depth
- 2bit flag
Is a 4-byte hash enough? I read somewhere that 32-bit hashes were more prone to full collisions and therefore incorrect search results. How true is this?

Bob published a paper on exactly this topic. Maybe you want to have a look at it: http://www.cis.uab.edu/hyatt/collisions.html

verminox · Post by **verminox** » Tue Jun 02, 2009 11:26 am

Codeman wrote:well, it does this to improve performance. When I am compiling in C++ usually I manually set the compiler to align to 16 bytes.

That would be fine in normal circumstances. But for a hashtable with millions of entries, isn't that wasting millions of bytes? Wouldn't it be worth the extra effort in return for a larger table?

Codeman wrote:All this exception handling is cheap compared to the time it takes to random access data from the ram.

Hmmm, ok I did not know that. When a local variable is created in a method, does it take just as long to access that method every time? Because I have never made such considerations for memory access while trying to optimize the search.

I use a bunch of rather unnecessary counters/stacks for more verbose thinking, never thinking those could be a bottleneck. I mean compared to all that move generation, evaluation, recursion, etc. these counters should hardly matter..... right?

Codeman wrote:Bob published a paper on exactly this topic. Maybe you want to have a look at it: http://www.cis.uab.edu/hyatt/collisions.html

Thanks, I'll have a look at that

kranium · Post by **kranium** » Tue Jun 02, 2009 12:13 pm

it really depends on the platform doesn't it?

for ex:
my understanding is that certain processor instructions sets require that the data to be 16-byte aligned.

and if so, there can be substantial performance advantages...?
i.e. don't know why java would be any different, as the resulting binary would utilize these same instruction sets, and there are many c++ optimization guides that recommend aligning all data.

Comparing results

Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results

Re: Comparing results