Hash collision?

silentshark · Post by **silentshark** » Wed Jun 05, 2019 8:00 pm

Looking at the recent debug logs of a game, I have a bug which I can't reproduce. All too common!

The bug manifests itself in some weird scores on certain moves. It may well be something else, but I'm wondering whether I've got a hashing problem.

I use a 64 bit int for the main hash position, and then calculate the index by logical AND'ing the hash with the size of the hash table. I then store the hash position in that index, with various replacement strategies. This has been in place for 20 odd years

One thing I've been recently experimenting with recently is storing fewer bits in the key. I noticed that some of the top engines were just storing a 16 bit representation of the hash position, e.g. via the >> operator. This seemed a bit low, but I thought I'd try it. I got a speedup and saved some memory. Good news.

The 16 bit representation worries me, though, and I am wondering if this particular bug could really, truly, be a hash collision which has come to bite me. What do folks think? Am I right in worrying here, or is this likely another type of bug?

Joost Buijs · Post by **Joost Buijs** » Wed Jun 05, 2019 8:31 pm

Really? I'm not an expert by any means, but a 16 bit key will give you many collisions each second, so it is very likely that this is the problem.
Maybe 16 bits is somewhat faster (what I doubt), and it will save some memory, but it is better to be safe than sorry.
I use a 64 bit key and a 64 bit data field, in my case 4 entries fit exactly in one 64 byte cache line, using a smaller key won't help me at all.

Nowadays storage is not a problem at all, for blitz games I never use more than 1 GB hash and for games with longer time controls I never use more than 8 GB. My PC has 64 GB RAM, but with larger hash-tables it slows down too much if I don't use large or huge page memory.

mar · Post by **mar** » Thu Jun 06, 2019 12:04 am

Joost Buijs wrote: ↑Wed Jun 05, 2019 8:31 pm Really? I'm not an expert by any means, but a 16 bit key will give you many collisions each second, so it is very likely that this is the problem.

The idea is that you store say n bits as hash in the table and use another m bits to compute the TT index.
Unless I miscalculated, a 4MB hash table with 16 byte entries and 4-slot buckets gives you effectively a 32 bit hash this way (assuming n=m=16).
You get more effective hash precision as TT size increases. Still 16 bits seems too low to me, but I assume it's been tested.

mar · Post by **mar** » Thu Jun 06, 2019 12:33 am

When I think about it, you could do 8-byte TT entry while storing more than 16 bits of hash part:

Code: Select all

16 bits hash
16 bits value
16 bits move (+1 bit free for say lsbit of bounds: 12 bits to encode from/to, 2 bits to encode promotions and 1 bit for castling; this bit can be saved too if you don't care about chess960 or encode chess960 castling as king captures own rook)
8 bits depth (+1 bit free for msbit of bounds, depth up to 127 should do)
8 bits age (+n bits free for more hash bits; if you can escape with say 6 bits of age, then you can store 18 bits worth of hash value (19 if you use clever encoding of castling)

8-byte TT on 64-bit architecture has another benefit of stores/fetches being atomic (at least on x86)
EDIT: why only 2 bits for promotions: 0 = knight, 1 = bishop, 2 = rook, 3 = queen => only for pawn moves reaching last rank

Joost Buijs · Post by **Joost Buijs** » Thu Jun 06, 2019 8:17 am

mar wrote: ↑Thu Jun 06, 2019 12:04 am
Joost Buijs wrote: ↑Wed Jun 05, 2019 8:31 pm Really? I'm not an expert by any means, but a 16 bit key will give you many collisions each second, so it is very likely that this is the problem.
The idea is that you store say n bits as hash in the table and use another m bits to compute the TT index.
Unless I miscalculated, a 4MB hash table with 16 byte entries and 4-slot buckets gives you effectively a 32 bit hash this way (assuming n=m=16).
You get more effective hash precision as TT size increases. Still 16 bits seems too low to me, but I assume it's been tested.

Yes, this is something I did back in the eighties, at a certain point I ditched it because memory was getting so cheap that it really didn't matter anymore.

Of course you also have to store your data. If you manage to fit everything in a 64 bit word (16 bit key, 48 bit data) you will get a 24 bit index when using a TT-size of 1GB, so 16+24 = 40 bit hash total, maybe enough maybe not.

In my case I need more than 48 bit data, so 64 bit entries are out of the question. I decided a long time ago to use 64 bit data with a 64 bit key, a 4-slot bucket fits nicely into a 64 byte cache-line and having native data and key fields makes the lock-free XOR trick very convenient.

Since I run on x64 CPU only I use CMPXCHG16B or InterLockedCompareExchange128() to store and load the entries atomically, with the XOR trick this is redundant but it doesn't hurt either, you only have to make sure that the entries are 16 byte aligned, otherwise bad things will happen.

hgm · Post by **hgm** » Thu Jun 06, 2019 2:07 pm

For those who want to squeeze out as many bits as possible: moves can be easily encoded in 13 bits, just adding one bit next to the from- and to-square numbers to flag all special moves. When this flag is set the meaning of the to-square field changes, and should be used as index in a table to specify all other details of the move, the move step always being one of those. This step is then added to the from-square to get the real to-square. This way you need 24 table entries for promotions (3 possible steps for each color, 4 choices), two for a Pawn double-push (of either color, special because it creates e.p. rights), four for e.p. captures (2 directions for each color) and four castlings, only 34 in total.

As most of the moves are quite rare, it doesn't matter much how long it takes to decode them; the overhead on the 'normal' moves is limited to a test-and-branch on the flags bit. The least rare are the double-pushes; to speed up decoding you could even give each of those its own code (say 32-47), and assign them such that to decode them you just have to subtract 8 (plus the flag bit) from the move code to get the normal move encoding, and treat it as such after deriving the e.p. square (as to ^ 8).

Code: Select all

if(move & 1<<12) {
  if(move & 32) { // double push
    move -= 1<<13 | 8;
    epSqr = (move ^ 8) & 077;
  } else {
    int nr = move & 31;
    move += (move >> 6) + table[nr].step; // calculate to-square
    switch(table[nr].type) {
      case CASTLING:
        ...
      case EP:
        ...
      case PROMOTION:
        ...
    }
  }
} else epSqr = INVALID;

(Note that there exist even more compact encodings, like 8 bit per move, but these would require extensive decoding even for normal moves, and not resistant to having very many promoted pieces and such, so I won't consider these.)

For an age field 2 bits seems sufficient (this search, previous search, older). With undercut replacement you actually wouldn't need to keep track of aging at all. But to guarantee survival of deep entries in an analysis run it is sometimes convenient to have it. E.g. reserve two entries in the bucket to keep the highest depths ever encountered in that bucket in the current search, (and only these would need age bits, perhaps only 1 each), and use the other entries for always-replace and undercut.

mar · Post by **mar** » Thu Jun 06, 2019 2:48 pm

That's a nice idea, so this way it would be possible to store 24 bits of hash in a 8-byte TT entry. Not bad at all.
Maybe the move could be encoded in even less bits, I like the idea of always encoding target delta; for queens that would be 14 positions so still fits in 4 bits
For pawn promotions, that's 12 values if I'm not mistaken + 4 values normal move including a double push
EP delta could reuse 3 of the promotion values, based on pawn rank of source square.
For castling, 10 values should do (king move + castle qs or ks)
So maybe even 10 bits would be enough to encode a move, implying 27 bits of hash could be stored this way in a 8-byte TT entry.
EDIT: if performance is not an issue and one could afford to generate all legal moves, then of course 8 bits should do as a simple index into movelist

jdart · Post by **jdart** » Thu Jun 06, 2019 3:27 pm

I store two 64-bit words in the hash table. 16 bits of the hash field are used to store some other data, so 48 bits of hash are stored. But since the low order bits of the hash code are used to select a hash "bucket", effectively 64 bits of hash are used if the hash size is > 65536 entries.

Empirically with modern search speeds even 64 bits is not enough to ensure zero collisions although collisions will be rare. 32 bits will cause frequent collisions.

--Jon

hgm · Post by **hgm** » Thu Jun 06, 2019 5:19 pm

To ponder on this a bit more: 8-bit encoding would work by encoding piece number and move number for that piece type. Pawns would need 16 codes each (3 moves times 4 promotion choices). For 8 Pawns that consumes 96 codes. Rook and Bishop have 14 moves, for another 56 codes, the Queen gets 28 codes, King and Knights each 8. That brings us up to 204 codes. Plus 2 for castlings. Downside is that the meaning depends on the side to move (but this is usually known), that for Pawns it depends on their location (promotion to Queen (say) being used for normal moves of non-promoting Pawns, and to Knight for e.p. or double-push). And that it cannot handle positions with super-numerary pieces, although there is room to accomodate 2 Queens and 3 Knights. And that steps in the negative direction must wrap around the board, like it is toroidal. And that it is not resistant to handling the case where two identical pieces swap location, unless you interpret the piece number as the ordering on the board rather than in the piece list. For a program that has to work on a micro-controller with 512 bytes RAM this is probably what I would do, but otherwise it would be a little extreme.

On average a Knight as 5.25 moves, a Bishop 8.75, a Rook 14. Together that makes on average 28 possible moves from a given square. So the total number of moves is 64*28. With 11 bits you could enumerate these, and have 256 codes over for specifying promotions (22 x 4 = 88 combinations of step and promotion choice for each color), 2 x 8 = 16 double-pushes, 2 x 14 = 28 e.p. captures and 4 castlings. So you could use these 11 bits as index to a table that provides from-square, to-square, e.p.-square (after double-push). That groups Pawn double-pushes with the normal moves, and makes the branck you need to catch the special moves very accurately predictable (as promotion, castling and e.p. capture are really rare).

Code: Select all

fromSqr = fromTable[move];
toSqr = toTable[move];
if(move < 64*28+16) { // normal moves and double-pushes
  epSqr = epTable[move]; // for normal moves the table holds the INVALID code
} else { // move that needs special board update
 epSqr = INVALID;
 if(move >= 64*28+16+28+4) { // promotion
   promoPiece = epTable[move];
   ...
 } else if(move > 64*28+16+28) { // castling
   rookFrom = epTable[move];
   rookTo = toSqr + fromSqr >> 1;
   ...
 } else { // e.p. capture
   victimSqr = epTable[move];
   ...
 }
}

Disadvantage is that you need 6KB in table space, which causes cache pressure; advantage is that the common case only contains one (correctly predicted) branch.

When the board step is encoded rather than the to-square itself, one can do with 12 bits and small tables. A Queen can have 56 different move steps, the Knight has 8 other. That already requires all 64 codes. But even with contiguous 0-63 square numbering, it is easy to have the purely vertical moves wrap, so that moving up 7 ranks is the same as moving down 1: just AND the calculated to-square with 077. So there is then no need for the negative steps, freeing 8 codes for indicating specials. These could be used to indicate left / right / center promotion, left / right e.p. capture and double-push. In combination with the rank of the from-square this fully specifies the type of special move. If the white and black double-pushes get assigned different codes, they would not even have to be treated as special, when we look up the e.p. square in a 58-byte table (relative to the from-square) for all normal moves; this might be faster than an often-mispredicted branch. Promotion implies the rank, so the rank-number of the from-square can be used to indicate the promotion piece. Castling and e.p. capture also have implied ranks, but there are only 2 castling moves (left or right), while the e.p. capture is fully determined by the from-square and the e.p. square (and thus only needs a single code), so that the rank-field of the from-square is not needed, and can keep its own (redundant) value.

Code: Select all

signed char steps[64] = {
 ...
 +16, -16, +2, -2, 0, +1, -1, 0   // entry 56 and further (W & B double push, L & R castling, e.p.,  L, R & M promotion)
 };
 signed char epTable[58] = {
 -1, -1, ...         // use -1 as code for no e.p. rights
 ...
 0, 0
 };

toCode = move & 077;
fromSqr = move >> 6 & 077;
step = steps[toCode];
if(toCode < 58) { // common case: normal move (incl. double push)
  skipSqr = (fromSqr ^ 030) |  epTable[toCode]; // flips rank
  toSqr = victimSqr =  fromSqr + step & 077;
} else { // rare: castling, promotion or e.p. capture
  static signed char rankCorrect[] = { 0, 060, 020, 040, 040, 020, 060, 0 }; // after XOR produces 0, 070, 0, 070, ...
  static signed char promoTable[] = { n, N, b, B, r, R, q, Q };
  static signed char rookTable[] = { -4, -1, 0, 0, 0, 3, 1 }; // zero values unused
  skipSqr = -1; // doesn't generate e.p. rights
  if(toCode == 60) { // e.p. capture
    toSqr = epSqr; // the e.p. square is presumably known
    victimSqr = toSqr ^ 010; // shift one rank
    ...
  } else {
    toSqr = fromSqr + step; // this is only the sideway step of the move (for color-independence)
    if(toCode > 60) { // promotion
      int rank = fromSqr >> 3;
      toSqr ^= rankCorrect[rank];   // 8-byte table, shifts to-square to 1st/8th rank
      promoPiece = promoTable[rank]; // 8-byte table, 4 white and 4 black choices
      fromSqr = toSqr - step; // reconstruct true rank from-square
      victimSqr = toSqr;
      ...
    } else { // castling (a sideway step is just what we needed)
      victimSqr = toSqr;
      rookTo = fromSqr + rookTable[step + 2];
      rookFrom = fromSqr + rookTable[step + 3];
      ...
    }
  }
}

hgm · Post by **hgm** » Fri Jun 07, 2019 8:43 am

BTW, it occurred to me that if you are prepared to use a 24-bit signature for the benefit of having atomic access to entries (3 byte signature, 1 byte depth, 2 byte move/bound flag, 2 byte score), it might be beneficial to put only 7 such entries in a cache-line-aligned 64-byte bucket in the following way:

Code: Select all

typedef struct {
  unsigned int keyDepth; // packed depth (LSB) and 24-bit signature
  short int score;
  short int moveBounds;  // packed move and bound type
} Hash Entry;

typedef struct {
  char filler;
  unsigned char fingerprint[7]; // 8-bit signature extensions for each slot
  HashEntry slot[7];
} HashBucket;

The hash probe could then start by reading the fingerprint word, and comparing the fingerprints to the relevant 8-bit part of the hash key. It doesn't have to wait for any of the other 7 words to be read from RAM to test the fingerprints for all 7 slots, and if none of the fingerprints of the slots it wants to probe matches, it already knows there will be no hit, and can continue without accessing (i.e. waiting for) the actual entry data to come in. If there is a matching fingerprint, there is only a 1-in-256 probability that it is a false match, i.e. you will almost certainly have a hit.

As the huge majority of hits will be on the always-replace entries, you can designate bucket->slot[0], the word that will be brought in second, (and perhaps bucket->slot[1] as well) as always-replace entry. Almost always you will have either concluded a miss, or have the sought data at that point; only rarely you would have to wait for any of the other 6 words to be fetched from RAM. (If you use two always-replace entries it might be an idea to bump slot[0] to slot[1] before storing in slot[0], so that slot[1] always contains the older entry, as recent entries are more likely to be hit again.)

This way you will have restored the signature length to 32 bit. (Good enough for only a 1-in-4-billion collision rate, with a completely filled table.) Access to the full signature will now not be atomic, but in a way that is harmless: if corruption occurs because of competing non-atomic accesses, the worst that can happen is that it spoils the signature, as this is the only item distributed over the two atomic parts. The actual data will always keep its integrity, but it will just not be recognized anymore, or be falsely recognised just like any other key collision can falsely recognize it. If another thread just changed the data after your probe already matched the (now obsolete) fingerprint, you are still using a 24-bit signature atomically locked to the data, and if less that 1 in 256 accesses are corrupted that way, this doesn't really drive up the number of false hits you already have from genuine key collisions.

Each bucket even has a byte to spare. This could be used for aging flags. E.g. a bucket could contain 1 always-replace entry, 2 undercut (or equidistributed-draft) entries, and 4 depth-preserved entries (where each probe would only consider the always-replace, 1 undercut and 2 depth-preserved entries). Only the depth-preferred entries would need aging, so you could pack a 2-bit aging flag for each of those in the filler byte. That the age bits are not atomically locked to the data would not cause much hardship. BTW, it would be best not to write the age flags if not needed; this would dirty the cache line, and require it to be written back to memory later, consuming valuable memory bandwidth. So the aging bits should only be set to the code of the current search when they don't have that value already. That makes it convenient the always-replace and undercut slots have no aging bits, as the depth-preferred entries will almost never be accessed. (But it pays to keep them around, because when they are accessed, they save a ton of work.) As the table fills up they sooner or later will all belong to the current search, so that the overhead of dirtying the cache by aging disappears entirely.

Code: Select all

// probe
const int key24 = 0xFFFFFF00;
HashBucket bucket = &hashTable[hashKey & hashMask];
int slotNr = (int)hashKey >> 31; // -1 or 0
int half = slotNr; // remember which half of the bucket we are going to use
int signature = hashKey >> 32;
unsigned char fp = signature & 0xFF;
if(bucket->fingerprint[0]==fp && !((bucket->slot[0].keyDepth ^ signature) & key24) && !(slotNr = 0) || // side effect!
   bucket->fingerprint[slotNr+=2] && !((bucket->slot[slotNr].keyDepth ^ signature) & key24) ||
   bucket->fingerprint[slotNr+=2] && !((bucket->slot[slotNr].keyDepth ^ signature) & key24) ||
   bucket->fingerprint[slotNr+=2] && !((bucket->slot[slotNr].keyDepth ^ signature) & key24)   ) {
    // hash hit
    if(slotNr > 2) {
      int ageField = 3 << 2*slotNr-6;
      int age = bucket->filler;
      int difference = (searchNr ^ age) & ageField;
      if(difference) bucket->filler = age ^ difference; // update age only if needed
    }
    hashDepth = bucket->slot[slotNr].keyDepth & 0xFF; // unpack
    ...
}

Hash collision?

Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?

Re: Hash collision?