Gaviota EGTBs, interface proposal for programmers

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

IQ
Posts: 162
Joined: Thu Dec 17, 2009 10:46 am

Re: Gaviota EGTBs, interface proposal for programmers

Post by IQ »

Hi,

yes that hit rate seems reasonable. But this leads to another interesting idea: For any probe the tb code has to decode a whole block (4kb?). Why then not put everything in this block (or a subset) into the hash table, instead of just returning one probe. All what's needed is a "block->hash" converter. This would bypass all the individual probes and eliminate the need for soft probing as the hash table is already in alignment with the egtb cache. I wonder how much time relative to individual probing would be needed to convert a whole block (or a subset) to hash entries. Any ideas?
michiguel wrote:
IQ wrote:Hi,

i wonder how effective this scheme really is. I would reckon that if hash table probes are also stored in the hash table, the number of "probe_softs" that are actually a tb cache hit to be minute. Do you have any numbers of the probe_soft hit rate?
The difference is huge!!. In the example I posted above you can see this

GTB CACHE STATS
probes: 144794
hard: 5406
soft: 139388
hits: 64661
misses: 1115
softmisses: 79018
efficiency: 98.3%
average searches 786.4
occupancy: 54.4%

there you have 5406 "hard probes", but the number of hits was 64661 (that is soft + hard hits). In other words, the efficiency of total hits increases in such a way that is 12X more than the number of "hard" probes itself!
Half of the soft probes are misses, but that's not big deal because they are relatively cheap.

There are many soft hits that are not in the hash table. Those are positions that were never visited in the search, but are neighbors in the tb blocks, when they are decompressed.

Miguel




The other scheme with the async i/o is also flawed as it would have to be queued. It would pile up a huge amount of probes in one part of the tree.

Therefore i would propose a prioritzed async queue system - the prioritization can easily done via search depth. This should ensure that the important tb probes actually get done in time to be meaningful for search.

What do you guys think?
It deserves experimentation but I think that the return will be little if the soft probes are implemented. Particularly because these queued systems will only work with uncompressed TBs.

Miguel

-IQ
michiguel wrote: Because of this idea is that I started to write my own TBs so I can implement it. However, I found that something simpler could be done. This is what I have already in Gaviota and it works very well (not in the released version yet)

tb_probe_hard ()
{
/* it is the normal blocking probe. it probes the cache, if the info is not there, it loads the cache with the info needed and also returns the info */
}

tb_probe_soft()
{
/* probes the cache, if it is not there, return tb_UNKNOWN */
}

So, when the remaining depth is < 2 I only do tb_probe_soft(). At any other point, I do tb_probe_hard().

I see no decrease in nps when using remaining depth = 2, which indicates that probing is not that bad because we do not probe so often, many positions are in the hashtable, and many are already in cache. If I probe in the last two plies "softly", those probes are for free and the number of probes may increase in an order of magnitude.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Gaviota EGTBs, interface proposal for programmers

Post by michiguel »

The problem is that the info in the TB cache is very compact (2 bytes per position but I could do 1 for 5-pc), whereas it is generally ~16 bytes per position in the hashtable. Every time I probe one position I fill 16 k entries (uncompressed, ~3 kbytes compressed), so we could easily trash the hashtable pretty quickly. Also the conversion to hashsignature may not be so cheap because the ones that we decompress may need to be calculated from scratch.

Miguel
IQ wrote:Hi,

yes that hit rate seems reasonable. But this leads to another interesting idea: For any probe the tb code has to decode a whole block (4kb?). Why then not put everything in this block (or a subset) into the hash table, instead of just returning one probe. All what's needed is a "block->hash" converter. This would bypass all the individual probes and eliminate the need for soft probing as the hash table is already in alignment with the egtb cache. I wonder how much time relative to individual probing would be needed to convert a whole block (or a subset) to hash entries. Any ideas?
michiguel wrote:
IQ wrote:Hi,

i wonder how effective this scheme really is. I would reckon that if hash table probes are also stored in the hash table, the number of "probe_softs" that are actually a tb cache hit to be minute. Do you have any numbers of the probe_soft hit rate?
The difference is huge!!. In the example I posted above you can see this

GTB CACHE STATS
probes: 144794
hard: 5406
soft: 139388
hits: 64661
misses: 1115
softmisses: 79018
efficiency: 98.3%
average searches 786.4
occupancy: 54.4%

there you have 5406 "hard probes", but the number of hits was 64661 (that is soft + hard hits). In other words, the efficiency of total hits increases in such a way that is 12X more than the number of "hard" probes itself!
Half of the soft probes are misses, but that's not big deal because they are relatively cheap.

There are many soft hits that are not in the hash table. Those are positions that were never visited in the search, but are neighbors in the tb blocks, when they are decompressed.

Miguel




The other scheme with the async i/o is also flawed as it would have to be queued. It would pile up a huge amount of probes in one part of the tree.

Therefore i would propose a prioritzed async queue system - the prioritization can easily done via search depth. This should ensure that the important tb probes actually get done in time to be meaningful for search.

What do you guys think?
It deserves experimentation but I think that the return will be little if the soft probes are implemented. Particularly because these queued systems will only work with uncompressed TBs.

Miguel

-IQ
michiguel wrote: Because of this idea is that I started to write my own TBs so I can implement it. However, I found that something simpler could be done. This is what I have already in Gaviota and it works very well (not in the released version yet)

tb_probe_hard ()
{
/* it is the normal blocking probe. it probes the cache, if the info is not there, it loads the cache with the info needed and also returns the info */
}

tb_probe_soft()
{
/* probes the cache, if it is not there, return tb_UNKNOWN */
}

So, when the remaining depth is < 2 I only do tb_probe_soft(). At any other point, I do tb_probe_hard().

I see no decrease in nps when using remaining depth = 2, which indicates that probing is not that bad because we do not probe so often, many positions are in the hashtable, and many are already in cache. If I probe in the last two plies "softly", those probes are for free and the number of probes may increase in an order of magnitude.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Gaviota EGTBs, interface proposal for programmers

Post by michiguel »

Dirt wrote:
michiguel wrote:PS: On top of all that, I should have a layer of bitbases, but let's leave that for later...
With bitbases it looks to me like you would only probe the full EGTBs at root. What this means for your optimal implementation of tablebase probing should perhaps be considered now.

Much of this discussion looks like it would be more appropriate for future six piece bitbases.
We can have two internal caches, one for bitbases, one for TBs. In fact, we may not need bitbases at all. We may decompressed the TBs as needed and store the info as "bitbases" in the bitbase cache. As this info is very compact, we may end up filling the cache with all the needed bit info. That may be as good as bitbases and completely transparent. It will be as good as increasing the cache of the TBs 8x. I may need to do an experiment, but I think that if we have a half a giga TB cache (as good as 64 Mb bitbase cache), TBs will be access in cache all the time.
hMx
Posts: 61
Joined: Wed Mar 08, 2006 9:40 pm
Location: Germany, Berlin

Re: Gaviota EGTBs, interface proposal for programmers

Post by hMx »

michiguel wrote:
hMx wrote:
michiguel wrote:Is there anything else that a tablebase interface should have?
I'm missing statistics.
Offer some struct and a function that fills numbers like:
#files opened
#probes
#has file
#disk reads
#bytes read
#cache efficiency
#cache hits
etc
That could help tuning memory sizes.
I have that here, so It won't be much effort to provide it. I thought people might not be interested on it, but you proved me wrong. How should I provide this? one function that outputs a structure with all the info or functions like this

Code: Select all

typedef uint64_t stat_t;

stat_t   tb_stat_reset &#40;void&#41;;
stat_t   tb_stat_probe_hits &#40;void&#41;;
stat_t   tb_stat_probe_miss &#40;void&#41;;
stat_t   tb_stat_cache_hits &#40;void&#41;;
stat_t   tb_stat_cache_miss &#40;void&#41;;
etc.
I think I would provide non-redundant information and let the user calculate whatever it can be deduced from the ones provided (for instance, cache_efficiency = cache_hits/ (cache_hits + cache_miss)).

Miguel
I would prefer a struct containing all the statics data. Otherwise I would make such a struct myself, and it gets out of date when you extend your software...

I don't mind if you also offer the functions for the single values. Might be useful to have.

Offering non-redundant data is ok, I think.
IQ
Posts: 162
Joined: Thu Dec 17, 2009 10:46 am

Re: Gaviota EGTBs, interface proposal for programmers

Post by IQ »

This on-the-fly generation of bitbases is pretty neat. I guess you decode the egtb in <16kb blocks - so for every probe of a egtb position, how much time do you think is needed to convert the complete block internally to bitbases?

Also one could think about rearranging the egtbs so that each block contains positions that are most likely to occur in the same search. But if your probe soft hit rate is an indicator, its already good enough.

Anyhow, this solves the problem with my previous idea - the bitbase format is much more efficient in storing the complete egtb block than the hash table.

michiguel wrote:
Dirt wrote:
michiguel wrote:PS: On top of all that, I should have a layer of bitbases, but let's leave that for later...
With bitbases it looks to me like you would only probe the full EGTBs at root. What this means for your optimal implementation of tablebase probing should perhaps be considered now.

Much of this discussion looks like it would be more appropriate for future six piece bitbases.
We can have two internal caches, one for bitbases, one for TBs. In fact, we may not need bitbases at all. We may decompressed the TBs as needed and store the info as "bitbases" in the bitbase cache. As this info is very compact, we may end up filling the cache with all the needed bit info. That may be as good as bitbases and completely transparent. It will be as good as increasing the cache of the TBs 8x. I may need to do an experiment, but I think that if we have a half a giga TB cache (as good as 64 Mb bitbase cache), TBs will be access in cache all the time.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Gaviota EGTBs, interface proposal for programmers

Post by michiguel »

IQ wrote:This on-the-fly generation of bitbases is pretty neat. I guess you decode the egtb in <16kb blocks - so for every probe of a egtb position, how much time do you think is needed to convert the complete block internally to bitbases?
I don't know, but certainly less than the time it takes to decompress it + download it. The lzma decompression speed is ~20 Mb/s. The average block size (compressed) is about ~3k, so that is about 150 microseconds. The very rudimentary experimental estimations I made of the average time it takes to download+decompress a block is ~600 microseconds with a USB stick (it means ~450 microsec should be download). With a solid state disk or a very fast HD the download+access time could be reduced 5x (I guess), so it may well be close to the decompression time. In any case, the overall time can hardly be lower than ~200-250 microsec. I think than converting that block into bits should be much faster than that, since it would be a simple loop that takes a int and releases a char ~4000 times. That means. 200 microsec/4000 = 0.05 microsec = 50 ns.
At 2 Ghz, 50 ns is 100 cycles. So I guess that 100 cycles is plenty of cycles to do that simple int to char packing.

I do not know how this would compete with a real bitbase configuration, but it will certainly be an almost free addition to the tablebases. The real effect will be exactly as increasing 4x or 8x the cache size of the TBs (depending how the internal storage is done).

It is possible that the bottleneck could be filling up the cache in fast games. If that is the case, maybe the cache should be filled up with the most common endgames at start up.

Also one could think about rearranging the egtbs so that each block contains positions that are most likely to occur in the same search. But if your probe soft hit rate is an indicator, its already good enough.
The format was designed in a way that kings and pawns (the less mobile pieces) cluster together. That should maximize, at least in a crude way, what you say.

Miguel

Anyhow, this solves the problem with my previous idea - the bitbase format is much more efficient in storing the complete egtb block than the hash table.

michiguel wrote:
Dirt wrote:
michiguel wrote:PS: On top of all that, I should have a layer of bitbases, but let's leave that for later...
With bitbases it looks to me like you would only probe the full EGTBs at root. What this means for your optimal implementation of tablebase probing should perhaps be considered now.

Much of this discussion looks like it would be more appropriate for future six piece bitbases.
We can have two internal caches, one for bitbases, one for TBs. In fact, we may not need bitbases at all. We may decompressed the TBs as needed and store the info as "bitbases" in the bitbase cache. As this info is very compact, we may end up filling the cache with all the needed bit info. That may be as good as bitbases and completely transparent. It will be as good as increasing the cache of the TBs 8x. I may need to do an experiment, but I think that if we have a half a giga TB cache (as good as 64 Mb bitbase cache), TBs will be access in cache all the time.