2-level TT

Rein Halbersma · Post by **Rein Halbersma** » Fri Jul 14, 2017 9:51 pm

A transposition table (TT) basically acts as software-based finite cache to relieve pressure on the much slower search. In particular, using the common implementation array of N-entry buckets , the TT mimics an N-way set-associative cache. The average lookup cost of a position is given by (see any lecture notes on caches)

Code: Select all

T_avg = T_lookup * hit_rate + T_search * &#40;1-hit_rate&#41;

Increasing the overall TT size will increase the hit-rate and thereby reduce the number of capacity misses, but will eventually also increase T_lookup (TLB misses). CPU caches benefit from 2-level caches: an L1 cache which is small and has very fast T_lookup (because of relatively low set associativity), with a larger but slower a L2 cache with a larger set-associativity (eg 4-way or 8-way) to reduce conflict misses.

So has anyone tried to do something similar with the TT? Eg a direct mapped TT1 (with always replace) of a few Mb, with a much larger secondary TT2 in the usual 4-bucket implementation. The TT1 might eg make qsearch TT lookups more palatable.

Obviously, the various cost components are depth dependent, so one could make the TT2 lookup in case of a TT1 miss conditional on remaining depth (eg not during qsearch).

Kotlov · Post by **Kotlov** » Fri Jul 14, 2017 10:08 pm

Hmm... Thinkable idea.

hgm · Post by **hgm** » Fri Jul 14, 2017 10:50 pm

I never had much success trying this. I guess one of the reasons is that the hardware caches already do it for you. The access to the least recently used elements of the large TT is already fast, because they are still in the CPU caches. Putting them in a smaller table that is always cached is thus only helpful for increasing their density: the hardware caches work one 64-byte bucket (usually 4 entries) at the time, and usually only a single entry is needed, and the other 3 are just wasting cache space. Only storing the entries you actually need in a small software cache would allow you to hold 4 times as many recent TT entries in the CPU cache. You would have to worry about preventing these to be flushed out of the CPU cache four at the time by the accesses to the larger TT, however.

Perhaps in SMP machines with very many cores, where hashing all nodes up to depth 0 causes a severe memory bottleneck, so that you are better off not storing d=0 (QS) entries, it could be helpful to use a cache-fitting small TT exclusively for d=0 nodes, and not shared between threads (which would also spoil their caching). I never tried that.

Rein Halbersma · Post by **Rein Halbersma** » Fri Jul 14, 2017 11:51 pm

Yes, I realize that the hardware can mask the benefits. For me, being able to do conditional secondary lookups (based on depth, or even on the search bounds) is the main idea. It's similar to how Ronald's endgame database lookups using unconditional memory mapped IO differ from the handwritten (and hence conditional) multilevel caching people use in draughts and checkers programs.

Also a direct mapped TT1 should be cheaper to probe than a 4-bucket TT.

2-level TT

2-level TT

Re: 2-level TT

Re: 2-level TT

Re: 2-level TT