SIMD methods in TT probing and replacement

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

SIMD methods in TT probing and replacement

Post by hgm »

I start this new thread because an incipient discussion on this topic elsewhere was burried onder a silly off-topic flame war.

Warning: no wildly off-topic postings will be tolerated in this thread; those will be deleted without further notice.

To recapitulate what was discussed so far:

A 64-byte hash bucket (aligned with a cace line for efficient memory access) could be organized such that part of the hash signature of each of the entries it contains is packed in one 64-bit word of the bucket. E.g. if there are 7 entries in the bucket, 9 signature bits of each of those could be packed in a 'signature extensions' word (with 1 bit to spare). One could then test for a hit by creating a key that contains 7 similarly packed copies of the corresponding bits of the hash key for the sought entry, and thus compare all 7 signature extensions at once for a match.

In code:

Code: Select all

typedef struct {
  // whatever you want to store in the TT, amongst which a hash signature of, say, 24 bits
} HashEntry;

typedef struct {
  uint64 extensions;
  HashEntry e[7];
} HashBucket;

#define PACKED(X) ( (X) | (X)<<9 | (X)<<2*9 | (X)<<3*9 | (X)<<4*9 | (X)<<5*9 | (X)<<6*9 )
HashBucket hashTable[SIZE];
uint64 hashKey;

// probe code

uint64 index = hashKey & hashMask;
uint64 sought = (hashKey >> 64-9) * PACKED(1);
uint64 diffs = sought ^ hashTable[index].extensions;
uint64 tmp = diffs | PACKED(1<<8);
diffs |= ~PACKED(1<<8);
tmp -= PACKED(1);
diffs |= tmp;
if(~diffs) { // at least one of the signature extensions did match the hash key
  // use standard bit-extraction technique on ~diffs to see which entries that were, and check their signature to see if they are true hits
  ...
}
The decrement of tmp is used as a trick to collect the OR of all lower bits of a signature-extension difference into the highest bit, as the carry will only propagate to that (perviously set) highest bit when all lower bits are 0. The highest bit itself is then ORed in afterwards, so that only a 0 remains if there were no differences at all (i.e. a match).

This makes the treatment of the common case of a hash miss very simple; the chances that a 9-bit key extension matches by accident is only 0.2%, so with 7 entries 98.6% of the misses would not have an accidental match, and would never have to touch any of the entries themselves. This greatly reduces the time the CPU would stall while waiting for those entries to be retrieved from memory, if we put the packed signature extensions in the first word that would be fetched during a cache-line-fill memory burst access (as we did).

For the 1.4% accidental matches we would examine the single matching entry for the remaining 24 key bits, and conclude we did not have a match after all. In general we would only loop over the entries for which the signature extension matched, to compare the remainder of the signature against the corresponding hashKey bits. So it wouldn't cause much overhead if those entries have to be unpacked first. In the case of a true hit it would be the only matching signature extension in 98.8% of the cases, and in half of the other 1.2% we would happen to try the truly matching one first.

In this example the entries themselves were packed in 64-bit words, which should have been made easier by 'outsourcing' a number of signature bits to the signature extension. Such words are accessed atomically, so in an SMP environment there is no chance a memory race could separate the data in the entry from its signature. There can be a race between the entry and the signature-extension write, leading to a corruption of the kind where a signature would be combined with the signature extension that belonged to another entry. But this would then merely invalidate the total signature. Even if other data from the entry would have been stored in the signature-extension word (which we did not do here), it would be moderately protected from corruption by a memory race: only if the other entry it came from would by accident have the same signature extension, such data could be mistaken for genuine. For things like aging bits that seems entirely acceptable.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SIMD methods in TT probing and replacement

Post by hgm »

Another thing that could be done is take the depth out of the individual entries, and pack those in a separate word too. This is something one would have to consider anyway if one did not manage to fit the entry in 8 bytes despite the reduced signature size. Then it cannot be avoided that part of the entry must be located in a second word, and it would probably mean there is no room for 7 entries in a bucket. If you would not want to switch to four 16-byte entries immediately, this means the entries will need to be packed in an irregular way.

So suppose we have 6 entries per bucket, most of their data together with a sizable hash signature in an atomically accessed 8-byte word. The remaining two words are the packed signature extensions discussed above, and packed depth and aging bits. We have 10 bytes per entry available (with 4 to spare), so assume 7 bits depth and 3 bits aging. For each 10-bit field belonging to an entry we put the aging in the lowest 3 bits, and the depth in the 7 highest.

We can now use this to our advantage during replacement. Suppose we want to do "shallowest-of-6" replacement. This would require us to figure out which of the 6 stored entries had the lowest depth. But we would want to treat the depth of stale entries, not used in the current or previous search, as having depth zero, so we would preferentially replace those. This could be done as follows:

Code: Select all

#define PACKED(X) ( (X) | (X)<<10 | (X)<<20 | (X)<<30 | (X)<<40 | (X)<<50 )

uint64 searchNr; // a packed quantity that is adjusted as searchNr = searchNr - PACKED(1) & PACKED(7) before every search
uint64 depths = hashTable[index].depthAndAge;
uint64 ages = depths & PACKED(7); // the age bits contain 8 minus the searchNr of their last access
depths &= ~PACKED(7); // squash age bits so that only depths remain
ages += searchNr; // calculate the 'life time' of the entries (0 = current search, 1 = previous search, etc.)
ages &= PACKED(6); // enforce modulo-8 arithmetic, and also group pairs ( {0, 1} -> 0 etc. )
ages -= PACKED(1); // this gets negative for 0 (formerly 0 or 1), and then propagates carry all the way to the lsb of the next
depths &= ages; // squash the depths of the stale entries
With "shallowest-of-6" replacement one of the entries (could be anyone) will typically serve as always-replace slot of the bucket, and thus contain depth 0 or 1 almost all the time, while the other entries continue to accumulate ever higher depths. We can test for presence of such a low depth through

Code: Select all

depth += PACKED(2) // prepare 'carry fence' in the now unused age fields
       - PACKED(2<<3); // subtract 2 from the depths; d=0 or 1 will now go negative, their carry flowing into the next higher age field
depth &= PACKED(1<<10); // squash everything but the carry that came out of the depth fields
if(depth) { // there was at least one entry with depth 0 or 1
  // use standard bit extraction to figure out which entries this are
  ...
} else { // backup procedure for the rare case all entries containg high depth
  ...
}
For more refined replacement schemes we could use a similar technique by subtracting a multi-copy of the depth of the entry to be stored (instead of 2) from all the depths, to see if the new arrival qualifies for storage in a depth-preferred (or undercut) slot:

Code: Select all

depth += PACKED(2) - PACKED(1<<3) - depthToStore*PACKED(1<<3);
One could tweek the added constant and multiplier away from homogeneously packed quantities to account for always-replace slots (for which the multiplier is then set to 0), and under-cut slot (for which you would subtract 2 instead of 1 from the depth), to get all elegible depth-preferred and undercut slots at once. Only if multiple slots would qualify we would have to do additional processing to decide which one we prefer to replace. If none qualifies (which will be the common case, as almost all depths we want to store are 0 or 1), we can write the new entry to the always-replace slot immediately.

Downside of such a scheme is that the depth of the entry (and its age) is not under any protection from SMP corruption. So we might occasionally encounter an incorrect depth from a hash probe. Now this is a reasonably harmless event: if the depth is lower than it had to be, we might not accept the score for a hash cutoff. Which we would not have done anyway if the corruption was recognized, as we would then have rejected the entire entry, including its hash move (which is in the atomic part, and thus correct). In case the depth corruption would give us too high a depth, we might accept an unreliable score. (But not a completely crazy one, as the score is also in the atomic part. So it was a score from the same position, but just not from such a deep search as we are led to believe.) I could live with that.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: SIMD methods in TT probing and replacement

Post by mar »

I think I still like the idea of 7 entries + extended signature better, because in the case of a TT store
the 7-entry version does 2 writes instead of 3 and would require more exotic packing/unpacking
(plus I like having 7 entries/bucket rather than 6, even though the 6-entry version could store larger keys)

I thought a bit about what would be the best way to store the packed TT.

I'm not sure how many bits would be really sufficient for age (I've always only used age to identify old searches vs current anyway)

but there's always the possibility to shuffle the bits around a bit:

Code: Select all

9 bits extended signature in first qword, 1 bit wasted

per entry:
23 bits hash (total 32 bits stored in TT due to 9-bit ext sig)
2 bits bound
7 bits depth

so we have 32 bits left to encode move, score and age
16 bits score
4 bits age
12 bits move
12 bits per move are indeed sufficient the way you proposed, namely reuse free bits for promotion because you only need to encode the file for destination square.
castling encoded as king captures own rook to allow chess960
(this implies 8x8 board, regular chess)

assuming separate dedicated eval cache per thread, this is really all that's needed to store in a TT entry I think
Martin Sedlak
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SIMD methods in TT probing and replacement

Post by hgm »

The minimum number of useful age bits would be 2, because you want to distinguih at least 3 cases:
* already visited in the current search
* last visited in the previous search, could be we will still get to it
* not visited in current or previous seach (stale)

I don't see a very strong argument for using more. If you purge stale entries as soon as you visit their bucket, it becomes very unlikely that such an entry survives a completed search, let alone a number of such searches. For an entry that survives too long the age counter would wrap around, and then it would not be recognized as stale anymore for the next two searches. But this would only happen if there is very little demand for that bucket, and it is questionable if in that case it would hurt much to allow the entry to linger on two more searches.

With 2 bits you have room even for distinguihing entries that were not used for one and for two full searches, and you could preferably replace the latter. With 3 bits you have room to even distinguish entries that survived 6 searches without being accessed. Would anything ever survive that long, when hash space is at a premium? And if hash space is abundant, whould you ever care how well replacement works?

BTW, I don't like aging in general. It requires you to modify the bucket even on a hit that caused a hash cutoff. That dirties the cache line, so that it will have to be written back when it finally gets flushed out of the cache. This causes a lot of extra memory bandwidth. Perhaps we should only update the age if the cache line will be written anyway.

As an alternative I often use 'equi-distributed draft' replacement. This keeps account of the number of entries of each depth that are in the table: on replacement I decrement the counter for the overwriiten depth, and increase the one for the stored depth. Whether an entry will be considered stale will then depend on whether there are already too many items of that depth in the table. This makes that entries are flushed from the table at the rate that new entries of the same depth arrive, i.e. the larger the depth, the longer they will survive. Then there need not be any aging bits in the table. With depth-preferred replacement the depth histogram slowly shifts to higher depth; When the number of entries of a higher depth gets larger than the number of entries of the lowest depth it is a good time to start overwriting the latter; the lowest depth will disappear fast enough on its own account, through replacement by deeper entries.

Your proposed bucket organization looks fine to me. I would probably go for 2 aging bits and 13 move bits; I like my move encoding such that it is trivial to recognize all moves that need special treatment (i.e. castling, e.p.-capture, promotion, and double pushes (that might need to create e.p. rights)). Using a dedicated bit to flag those minimizes the impact on the common case of the normal moves. When a move is flagged as special, I can then figure out what type it actually is, and perform the required special action accordingly. The to-square in that case will be used to identify the move, and be used as an index in a table that contains the move type, true to-square of the move, and info needed to perform the special action. Such as which squares would have to be tested for presence of enemy Pawns to create e.p. rights, where to find the Rook and where to put the King for castlings, what piece to promote to, where to find the e.p. victim. There are 3 rank-bits available, which could encode 8 possibilities; we only need to distinguish 4 promotions, e.p.-capture, double-push and castling.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: SIMD methods in TT probing and replacement

Post by D Sceviour »

It is increasingly difficult to cram everything into a long integer. How about a bigger hash entry?

The first entry is an unsigned 64-bit hash key. This cannot be shortened mainly because this is about the minimum required to create a unique key for a standard 8x8 board in chess, and to minimize the possibility of hash collisions.

The next entry is an unsigned 16-bit number use to signify the best move and its associated flag conditions for en passant, castling or promotion. [flags: 4 bits] [from sq: 6 bits] [to sq: 6 bits]

A signed 8-bit number is used to hold the last depth searched to a maximum of 128. Hopefully this is enough, yet occasionally my searches have been extending past 100 ply. One would think the 50 move rule would cut short any need for a bigger number.

Another number is used to contain the value or score returned from the search. This is normally a 16-bit signed entry.

Lets reserve another 8-bits for the transposition type or bound, and the transposition age.

That leaves 16 bits left over to cram a lot of extra information.

The first one that needs saving is the static evaluation. Saving this score can save a lot of time on the horizon evaluation.
Theoretically, the static evaluation has a 50% chance of being equal or better than the actual final value of the returned score. But this can be used as a hash refutation score. If the static evaluation is greater than the current alpha score, and the hash score is greater then the current alpha score and it is a principal variation but the depth is very shallow then should use a hash refutation window because the hash move may be unsound. The static evaluation is a signed 16-bit number and the hash entry is now saturated.

No more room.

---------------------------------------------------

Here are some extra things often needed:

A busy bit to indicate the node is already being searched by another core. (1 bit) or ...
A number-of-process to indicate how many core processes are already searching this node. (8) Perhaps best at root, or ply1 search?

A futility count. This flag indicates whether this position has been searched before by the previous move list. This helps determine how fast to prune the move. (8)

Full width search flag. has the move list been searched full width yet? Do we know what is in the ensuing move list. Maybe there is a winning move, but it is unknown because excessive pruning cut the search off before testing it. (1)

An incheck flag to indicate whether the position is incheck. Further, the checking status is a constant so it can be used as an additional entry for hash verification and collision reduction.((1)

A check flag to indicate whether there is a move to check the opponent. (1)

A hash move check flag to indicate whether the hash move gives check. This could save some legality testing. (1)

The last hash move after a new one is found. Forced positions are rare; evade a check or recapture a piece. Many positions have two or more near good moves, but the current method is to overwrite the hash move if a different one is found. Instead, keep an extra spot for the old move the same way that two killers are saved. (16 bits) Could also consider a separate small hash table just for PV, or a smaller pv2 list as second alternative to the main PV. Will this extra information be like rough extra growths or will it offer a richer search experience. Depth and score difference might have a pattern. If the move list is singular or forced, then the move is insignificant. The move is not stored in the history counter move list, unless it appears by accident.

Piece and pawn counts for white and black. These are usually needed for horizon evaluations, but also for draw calculations on every node. (16 bits)

Material key. This is a is a constant so it can be used as an additional entry for hash verification and collision reduction. The number can be quickly looked up in a table where imbalances can be pre-tabulated. However, it would be used only once and then the result would already be included in the hash score, so its does not look that important. (32 bits)

Nodes counts. a higher count means more probability of repeat hits and can assist in replacement schemes. (16 bits)

A learning value (used in Arasan somehow). Don't know what this does. (8 bits)

Generation (6 bits) used by Stockfish and Ethereal. I have no idea what it does, but it might be important.

We are already well over 64 bits so stop here. One thing to avoid is the saving of bitboards. There is still not enough room to satisfy all the demands for occupancy and attack boards. Which ones are the most important? That would be a problematic thing for testing.

Lets say we go to a 3 long word hash entry. This reduces the effectiveness of the hash size by 1/3. The first side benefit may be an improved hash verification number. The lockless method could now use 3 XOR's entries instead of 2.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SIMD methods in TT probing and replacement

Post by hgm »

The more you store in an entry, the fewer entries you can store. TT entries can potentially save you a lot of work, when you get a hash cutoff for a large sub-tree. Evaluation scores only save a fixed amount of work: one evaluation. So I seriously wonder whether storing the static evaluation score in the TT will pay off.

Information about material or Pawn structure would be much more efficiently stored in a material or a Pawn hash table, as very many positions in the TT will have the same material composition or Pawn structure. So putting that in the TT would waste an enormous amount of memory, by storing zillions of copies of the same thing. Don't do it!

A flag to indicate the side to move is in check can be very useful, and not very costly. (Just a single bit.) It allows you to search the hash move (which apparently is an evasion) without checking whether you are in check. Which you otherwise would need to do, in order to determine whether you should give an extension. (If you extend evasions, that is.) If you extend checking moves, or search them in some QS levels, a flag indicating whether the hash move also delivers check would also be useful. (E.g. to determine whether the hash move from a hit on a deeper entry is acceptable in QS.) Perhaps the in-check and checking flags can be combined in one, just meaning 'extend this move'. It could then also be used for singular moves.

I am not sure about a second hash move. A move gets to be hash move when it caused a cutoff. (Except in PV nodes, but those are really rare, and could use a completely different hash entry, e.g. storing the entire PV). So if you got a different hash move, it is only because a deeper search found out that the hash move failed low after all. I would say such a move is the least likely to be any good; better try all moves you did not yet search first. Of those you know nothing, and they could be good. The old hash move is already proven to be bad, and hoping it will miraculously recover on deeper search is mostly wishful thinking.

What you call 'generation' could be the aging field we discussed above.

Storing all 64 bits of the hash key is wasteful, as many (nowadays nearly 32) will be used as index to pick the bucket, so their match on probing is guaranteed. Even at 32 bit the signature is the largest item in the entry, and needlessly doubling its size will certainly hurt. I would not recommend it. (Although my engine KingSlayer, which has very primitive 2-way hash table, does this.)
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: SIMD methods in TT probing and replacement

Post by mar »

hgm wrote: Wed Feb 19, 2020 5:04 pm BTW, I don't like aging in general. It requires you to modify the bucket even on a hit that caused a hash cutoff. That dirties the cache line, so that it will have to be written back when it finally gets flushed out of the cache. This causes a lot of extra memory bandwidth. Perhaps we should only update the age if the cache line will be written anyway.
You're right, it actually didn't occur to me that you should update age in the case of a cutoff (I don't do that at all).
Does it pay off to offset the cost? I would think that updating age doesn't matter at shallow depths anyway. Perhaps a depth threshold?
As an alternative I often use 'equi-distributed draft' replacement. This keeps account of the number of entries of each depth that are in the table: on replacement I decrement the counter for the overwriiten depth, and increase the one for the stored depth. Whether an entry will be considered stale will then depend on whether there are already too many items of that depth in the table. This makes that entries are flushed from the table at the rate that new entries of the same depth arrive, i.e. the larger the depth, the longer they will survive. Then there need not be any aging bits in the table. With depth-preferred replacement the depth histogram slowly shifts to higher depth; When the number of entries of a higher depth gets larger than the number of entries of the lowest depth it is a good time to start overwriting the latter; the lowest depth will disappear fast enough on its own account, through replacement by deeper entries.
A very interesting idea. I imagine the counters could be kept thread-local to avoid false sharing. And it would mean those extra bits could be used for other purposes (maybe to extend the key).
I could certainly test equi-distributed draft versus aging once I have some time. Could you give me some pointers as to how to implement it properly? I mean the replacement scoring.
I like my move encoding such that it is trivial to recognize all moves that need special treatment (i.e. castling, e.p.-capture, promotion, and double pushes (that might need to create e.p. rights)). Using a dedicated bit to flag those minimizes the impact on the common case of the normal moves. When a move is flagged as special, I can then figure out what type it actually is, and perform the required special action accordingly. The to-square in that case will be used to identify the move, and be used as an index in a table that contains the move type, true to-square of the move, and info needed to perform the special action. Such as which squares would have to be tested for presence of enemy Pawns to create e.p. rights, where to find the Rook and where to put the King for castlings, what piece to promote to, where to find the e.p. victim. There are 3 rank-bits available, which could encode 8 possibilities; we only need to distinguish 4 promotions, e.p.-capture, double-push and castling.
I see, so you simply use an extra bit to mark a move as "special" meaning the to square has a different meaning.
So you're then use that for a "palette" lookup where additional flags/reconstruction info is stored.

There's still plenty of room for bit shuffling, so I don't think it's really that important if a move is 12 or 13 bits (especially if equidistributed draft would save plenty by removing age completely)
Martin Sedlak
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SIMD methods in TT probing and replacement

Post by bob »

hgm wrote: Wed Feb 19, 2020 5:04 pm The minimum number of useful age bits would be 2, because you want to distinguih at least 3 cases:
* already visited in the current search
* last visited in the previous search, could be we will still get to it
* not visited in current or previous seach (stale)

I don't see a very strong argument for using more. If you purge stale entries as soon as you visit their bucket, it becomes very unlikely that such an entry survives a completed search, let alone a number of such searches. For an entry that survives too long the age counter would wrap around, and then it would not be recognized as stale anymore for the next two searches. But this would only happen if there is very little demand for that bucket, and it is questionable if in that case it would hurt much to allow the entry to linger on two more searches.

With 2 bits you have room even for distinguihing entries that were not used for one and for two full searches, and you could preferably replace the latter. With 3 bits you have room to even distinguish entries that survived 6 searches without being accessed. Would anything ever survive that long, when hash space is at a premium? And if hash space is abundant, whould you ever care how well replacement works?

BTW, I don't like aging in general. It requires you to modify the bucket even on a hit that caused a hash cutoff. That dirties the cache line, so that it will have to be written back when it finally gets flushed out of the cache. This causes a lot of extra memory bandwidth. Perhaps we should only update the age if the cache line will be written anyway.

As an alternative I often use 'equi-distributed draft' replacement. This keeps account of the number of entries of each depth that are in the table: on replacement I decrement the counter for the overwriiten depth, and increase the one for the stored depth. Whether an entry will be considered stale will then depend on whether there are already too many items of that depth in the table. This makes that entries are flushed from the table at the rate that new entries of the same depth arrive, i.e. the larger the depth, the longer they will survive. Then there need not be any aging bits in the table. With depth-preferred replacement the depth histogram slowly shifts to higher depth; When the number of entries of a higher depth gets larger than the number of entries of the lowest depth it is a good time to start overwriting the latter; the lowest depth will disappear fast enough on its own account, through replacement by deeper entries.

Your proposed bucket organization looks fine to me. I would probably go for 2 aging bits and 13 move bits; I like my move encoding such that it is trivial to recognize all moves that need special treatment (i.e. castling, e.p.-capture, promotion, and double pushes (that might need to create e.p. rights)). Using a dedicated bit to flag those minimizes the impact on the common case of the normal moves. When a move is flagged as special, I can then figure out what type it actually is, and perform the required special action accordingly. The to-square in that case will be used to identify the move, and be used as an index in a table that contains the move type, true to-square of the move, and info needed to perform the special action. Such as which squares would have to be tested for presence of enemy Pawns to create e.p. rights, where to find the Rook and where to put the King for castlings, what piece to promote to, where to find the e.p. victim. There are 3 rank-bits available, which could encode 8 possibilities; we only need to distinguish 4 promotions, e.p.-capture, double-push and castling.
A note here. You don't have to update the aging info very often. If the current entry was stored during this search, where the stored_age matches current_age, no write. That actually covers the major fraction of lookups

I just added a counter for the number of times I had to update the age on any sort of Hash Probe.... In a 100M + node search, made after playing the previous move after a 500M node search, Crafty only had to update age because probe age did not match entry age. It had to update age on a probe 170K times. Roughly 0.01% of the probes, so not THAT terrible.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: SIMD methods in TT probing and replacement

Post by hgm »

Equi-distributed draft with 3 entries (one always-replace, two depth-preferred) would look like this:

Code: Select all

int n = -1; // gets value on probe hit

if(n < 0) { // replacement
  int d1 = bucket->slot[1].depth;
  int d2 = bucket->slot[2].depth;
  if(histo[d1] > histo[1]) d1 -= 1000; // make over-represented entries vulnerable
  if(histo[d2] > histo[1]) d2 -= 1000;
  n = 1 + (d1 > d2); // shallowest depth-preferred entry
  d1 = bucket->slot[n].depth;
  if(depth >= d1) { // new entry qualifies for a depth-preferred slot
    histo[depth]++; // update depth histogram
    histo[d1]--;
  } else n = 0; // otherwise it goes into always-replace slot
}

// hash store
bucket->slot[n].depth     = depth;
bucket->slot[n].score     = bestScore;
bucket->slot[n].signature = hashKey >> 32;
  ...
With more depth-preferred entries per bucket it would obviously be more work to figure out which has the lowest depth. I came to the conclusion that it doesn't help to try to rescue an overwritten non-stale depth-preferred entry by moving it to the always-replace slot; it would be so short-lived there that you might as well discard it immediately.

Once the TT gets overloaded (i.e. tree > hash size), the always-replace slot will mostly contain d=0 entries, and whatever gets in there will not survive to the next iteration. It is just impossible to save all the nodes of too large a tree, and the QS nodes will be the first victims. They are still very useful for catching transpositions in sub-trees, but they will not see their depth growing in subsequent iterations, as they would no longer be there. Their deeper researches would just start without a hash hit, and once they are deep enough, they would be put automatically in the depth-preferred part. It thus also seemed pointless to check if the always-replace entry should be promoted to a depth-preferred slot if its depth was increased after a hash hit on them. So the replacement is only considered after a miss. One could check for a high depth in the always-replace slot, and then move it elsewhere before copying the new entry in there, but I am afraid that would just never happen. The test for it is not that expensive, though.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SIMD methods in TT probing and replacement

Post by bob »

one point about Crafty:

many years ago I decided to not probe/store hash entries in the q-search. It made a pretty big difference in some positions, when you overwrite too much. Leaving q-search out really shrinks the amount of stuff you need to store, and at a relatively small expense. I ran a lot of tests and I could measure no difference between q-search and no-q-search. The decision became obvious because over-subscribed hashing has always been a problem.

Ending up with a hash table full of depth=0 entries was not what I wanted...