Improving TT replacement scheme

jmcd · Post by **jmcd** » Fri Jun 24, 2022 6:33 am

I'm currently trying to implement a transposition table replacement scheme that is able to outperform the current method of "always replace" that my engine uses at the moment. I've read about a scheme that stores two entries for every accessible index, with one that is replaced by depth and the other that is always replaced if the first replacement fails. It is outlined in Jon Pettersons blog here http://mediocrechess.sourceforge.net/gu ... ables.html

The code I've written to implement it is shown below. But for some reason this is still performing at around or slightly below the "always replace" scheme. Am I doing something wrong? I assume there is no aging mechanism that needs to be implemented for this method.

One potential problem I can theorize though is that the "replace by depth" indexes will likely become stale and infrequently replaced if they store a move from an exceptionally high depth, making half of the TT useless after a while.

Code: Select all

    TTEntry* TTable::probe(Key key, bool& found) 
    {
        auto index = key % half_tt_size;
        found = (ht[index].key == key);
        if (found == false)
        {
            found = (ht[index + half_tt_size].key == key);
            return &ht[index + half_tt_size];
        }
        return &ht[index];
    }

    void TTable::new_entry(Key key, int d, int e, HashFlag f, Move m)
    {
        auto index = key % half_tt_size;
        if(ht[index].depth <= d)
            ht[index] = TTEntry(key, d, e, f, m);
        else
            ht[index + half_tt_size] = TTEntry(key, d, e, f, m);
    }

hgm · Post by **hgm** » Fri Jun 24, 2022 9:28 am

You store the two entries that correspond to the same 'bucket' half the TT size apart from each other. That is very bad, because the TT is too large to fit in any cache, and accesses to DRAM memory are extremely expensive. (They take about the same time as 150 other instructions.) And now you do two of those on every probe. (Or at least every probe that misses, which are most.) You make that worse by the order you test the entries in: You first test the depth-preferred entry on a probe. While most hits will be in the always-replace entry (because a search tree does not search many positions very deep). So even in case of a hit you almost always have to access both entries.

Since every memory access automatically fetches 64 bytes from DRAM into cache, the second probe would be fast if you made sure it was into that same memory bock of 64 bytes ('cache line'). So one commonly divides the hash table in 64-byte (pr 32-byte) 'buckets', and uses a replacement scheme that has all possible places where a given position can be stored in the same bucket.

Also node that calculating the modulo by a number that is not known at compile time is very expensive (about as much as 70 instructions). And that your TT size is likely not known at compile time, if your engine is UCI and supports the Hash option. Bob Hyatt reported once that Crfty slowed down by some 30% by using modulo instead of the usual ANDing with a mask and liniting the TT size to a power of 2. If you really want to allow other TT sizes, there are faster methods to obtain an index than using modulo TT_size.

Finally, if you use a depth-preferring replacement scheme, this will eventually fill up the depth-preferred part of your table with very high depth entries, which will then linger on forever even if they are no longer relevant (because they correspond to positions early in the game which can no longer be reached with the currently present material). You should always combine it with some system to cleanse the table of stale entries (like aging).

jmcd · Post by **jmcd** » Fri Jun 24, 2022 10:23 am

Thanks a ton for your advice. Ill look into doing all of those, but buckets are what I dont totally understand at the moment. I'll have to do some reading on how they work.

Bo Persson · Post by **Bo Persson** » Fri Jun 24, 2022 10:43 am

jmcd wrote: ↑Fri Jun 24, 2022 10:23 am Thanks a ton for your advice. Ill look into doing all of those, but buckets are what I dont totally understand at the moment. I'll have to do some reading on how they work.

There is no magic here. A bucket can be just a struct with 2 or 3 entries (depending on their size), that fits within 64 bytes. If you also align it properly, this will then exctly fit one cache line.

hgm · Post by **hgm** » Fri Jun 24, 2022 10:45 am

'Buckets' is basically just a name for the group of entries where a position could be stored. E.g. in a scheme similar to yours you could have probed table entries 2*index and 2*index+1 rather than index and index+half_tt_size. Then each even and following odd entry would be the bucket. They would be next to each other in memory, making the chances that the second entry to probe was already automatically brought into cache as a side effect of probing the first quite large. Depending on the size of the entry you could even make that a certainty, by aligning the TT array with the cache lines (which start at memory addresses that are a multiple of 64). E.g. if an entry occupies 16 bytes, adjacent (even, odd) pairs would always be in the same cache line if the start of the array coincided with the start of a cache line.

As Bo says, you could make the bucket structure of the table explicit, by using a struct for it containing a number of entries, or leave in implicit through the way you calculate indexes from the key. A common reason to use an explicit struct is when the size of your entries is such that an integer number of them does not fill an entire 64-bit cache line. E.g. when an entry takes 12 bytes (3 ints). Then you could organize the Bucket struct such that it contains 5 entries and 4 bytes of 'filler'. (Which you could use as aging counters, because only depth-preferred entries would need aging.) You could use one of the entries (say the first) as always-replace, to which a position goes when it doesn't have the depth to replace the one in the depth-preferred entry where it maps to. So from the key you would not derive both a bucket number, and a number N of the entry inside the bucket (1-4). The position then goes into entry 0 or N, depending on whether it has enough depth for the latter. This is just one example of how you could do it.

jmcd · Post by **jmcd** » Fri Jun 24, 2022 11:16 am

Thanks again guys. I've gone with what you suggested and made a struct for buckets with a size of 2 for now. I hadnt read what you said about indexing by multiplying by two before I started, so if I cant get it to work this way I will try that. I must have made a large mistake somewhere because this is performing worse than my highly inefficient method of index + half tt size. Please explain like youre talking to a complete idiot. Buckets are currently 48 bytes, so I'm not sure if that causes a problem if it isnt exactly 64. The relevant code I've just written is shown below.

Code: Select all

    constexpr size_t tt_size = 4194304; // 2^22
    
    Bucket* ht;
    
    inline int TTable::hash_index(Key key) const {
        return key & (tt_size - 1); // is this the AND method for modulo you were referring to?
    }

    struct Bucket {
    public:
        TTEntry e1;
        TTEntry e2;
    };
    
    TTEntry* TTable::probe(Key key, bool& found) 
    {
        Bucket* b = ht + hash_index(key);
        found = (b->e1.key == key);
        if (found)
            return &b->e1;
        found = (b->e2.key == key);
        return &b->e2;
    }

    void TTable::new_entry(Key key, int d, int e, HashFlag f, Move m)
    {
        Bucket* b = ht + hash_index(key);
        if(b->e1.depth <= d)
            b->e1 = TTEntry(key, d, e, f, m);
        else
            b->e2 = TTEntry(key, d, e, f, m);
    }
    
    void age() { // method for avoiding stale e1 entries in bucket
            for (int i = master; i < tt_size; i += 2)
                if (ht[i].e1.depth > 1)
                    ht[i].e1.depth = 1;
        
            master = !master;
     }

hgm · Post by **hgm** » Fri Jun 24, 2022 11:59 am

Doing the extra pass through the TT for the aging is probably very expensive. I don't know how large your entires are, but even when you age e1 in every other bucket, you might still access every cache line in the TT. Besides, the way you do age the entries is very course: setting their depth to 1 is almost the same as erasing them altogether. Because they will very quickly be replaced by positions that have d==2, of which there will be a lot in the tree. If you want to age by adapting the depth, it would be better to just decrease the depth by 1.

You can avoid having to make an extra pass through the table by deciding 'on the fly' whether to age (i.e. while you probe). Then you don't need another access to uncached DRAM. E.g. you could decide to overwrite the depth-preferred entry even with a lower depth occasionally, with a low probability. A very simple method would be to also allow replacement oby a depth that is exactly one lower, ( 'undercut replacement'), in 1 out of 8 cases (e.g. when your node counter's 3 lowest bits are 0). Then the higher the depth, the longer it will last in the table.

lithander · Post by **lithander** » Fri Jun 24, 2022 2:58 pm

Code: Select all

        static int Index(in ulong hash)
        {
            int index = (int)(hash % (ulong)_table.Length);
            ref HashEntry e0 = ref _table[index];
            ref HashEntry e1 = ref _table[index ^ 1];

            if (e0.Hash == hash)
                return index;

            if (e1.Hash == hash)
                return index ^ 1;

            //raise age of both and choose the older, shallower entry!
            return (++e0.Age - e0.Depth) > (++e1.Age - e1.Depth) ? index : index ^ 1;
        }

I use this way of accessing my hash entries. You could say I have two buckets because two consecutive entries in the table are related in such a way that when you don't find the desired entry in one slot you also look into the other, neighbouring slot just by flipping the least significant bit. But both entries have an equal chance of being utlized because if the hash is somewhat randomly distributed they have equal propability to become e0 or e1.

On a miss I need to decide on one entry to replace. Usually I would want to keep the deeper entries because their knowledge was derived from deeper searches. The quality is thus higher. But I also want to get old but deep entries out of the system eventually. So whenever I have a table miss I increase the age. Eventually even deep entries will get removed if they haven't had their age counter reset to zero again, which I do after a table hit when the entry is actually usable.

Code: Select all

        public struct HashEntry
        {
            public ulong Hash;       //8 Bytes
            public short Score;      //2 Bytes
            public byte Depth;       //1 Byte
            public byte Age;         //1 Byte
            public int MoveAndType;  //4 Byte
            //=================================
            //                        16 Bytes 
        }

A HashEntry looks like this btw. For the "Score" I need to remember whether it's the upper bound, lower bound or exact. And that information only needs 2 bits. I have two bits to spare in the Move struct which only requires 28 bits. And so I merge the two into the MoveAndType field. The encoding and decoding takes time but now exactly 4 HashEntries fit into a cache line and in the end it's a net win, performance wise. (Even in C# where you don't get much control over the lower-level behavior)

hgm · Post by **hgm** » Fri Jun 24, 2022 3:31 pm

Your way of determining age seems a bit sub-optimal. Low-depth probes are enormously more frequent than those with very high depth (e.g. d=0 vs d=8), so a d=8 entry would get way more than 256 misses by d=0 probes before it gets a chance of being probed once.

A more common method is to have a global counter that is incremented every search (or every analysis session), and store that counter in the entry every time you get a hash hit on it. So that it records when the entry was last used. You can then overwrite positions that have not been used in the current or previous search (or the last N searches), by looking at the difference between the global counter and the age. Then even a 2-bit counter can be enough (when you do the difference calculation modulo 4).

The disadvantages of such methods is that they require extra space in the hash entry, for the aging counter. And that they tend to make the cache lines dirty (i.e. modified), so that these have to be written back when they get replaced by a cache line for another probe, doubling the memory bandwidth. (Which often limits the nps to start with.) This can be ameliorated by not writing the age field if it already has the value of the current search. A somewhat less accurate method would only write the age field when the position is first written to the table (when you dirty the cache line anyway).

I also started by using the 'shallowest-of-2' replacement, where the bucket size is 2, and the always-replace entry dynamically is decided to be the one containing the lowest depth. But when I realized that almost all hits will be on the entry with the lowest depth, I started preferring a design where you would know in advance which of the two entries in the bucket is the always-replace one, and which contains the high depth. So that you can test the former for a hit first. This speeds up the recognition of a hit. (And when you don't even get to examine the depth-preferred entry, you also don't have to bother with its age.)

These methods do not help when your table overflows during the search (or within one analysis session). This is why I usually also include an 'undercut' entry in the bucket. E.g. with a bucket of 5 entries I would reserve one as always-replace, where everything goes that was not able to qualify for any of the other slots. The other 4 can be treated as two pairs, each probe limited to one pair. One slot of such a pair would be the depth-preferred, and would contain the highest depth that ever mapped to that pair. This one will eventually fill up entirely in a long search, where they do not age. The other would be 'undercut', and get replaced by anything with a depth larger, equal or 1 lower. Note that it is usually better to also replace equal depth, even in the depth-preferred slot, because recent entries are more likely to be useful than older ones.

lithander · Post by **lithander** » Fri Jun 24, 2022 4:54 pm

hgm wrote: ↑Fri Jun 24, 2022 3:31 pm Your way of determining age seems a bit sub-optimal. Low-depth probes are enormously more frequent than those with very high depth (e.g. d=0 vs d=8), so a d=8 entry would get way more than 256 misses by d=0 probes before it gets a chance of being probed once.

And here I thought I had this one feature working well and finished!

You're exactly right, of course.

Leorik tends to be strong vs his peers on fast time controls and loses it's edge the longer the time controls are. The transposition table not being able to preserve high depth entries long enough could very well be a major contributor to that.

Improving TT replacement scheme

Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme