hash collisions

bob · Post by **bob** » Fri Jan 31, 2020 6:00 am

jstanback wrote: ↑Thu Jan 30, 2020 4:51 pm Hi Bob, As I mentioned in my previous post, I don't think storing a 64 bit hash signature is enough even now on hardware like at TCEC. Wasp used "only" 16G of hash memory = 1G positions, so 30 bits is used up in the address. Since the table is always full of entries, that leaves 34 bits of useful verification. So with 4 probes I think the collision rate is 4 in 2^34 or 1 in 4 billion. At 100M nps if the HT is probed at every node (not quite the case for Wasp) there should be around 360 collisions per hour of search. So I think the move retrieved from the HT cannot be played without enough verification to ensure it won't cause a crash. I added a full move legality test for each successful hash fetch and surprisingly the slowdown is only about 1%.

The code is below. I could eliminate the test for leaving king in check since that is done in the moveloop as part of my normal move legality test for pseudo-legal moves from the move generator.

John

Code: Select all

int HashMoveIsLegal(POS* pos, int side, MOVE mv)
{
  SQUARE f = moveFROM(mv);
  SQUARE t = moveTO(mv);
  PIECE  fpc = PieceType(Piece(pos,f));
  PIECE  tpc = PieceType(Piece(pos,t));
  BITBOARD bbf = SQ2BB(f);
  BITBOARD bbt = SQ2BB(t);
  BITBOARD occ = Occupied(pos);
  
  // check that the color of the moving piece is correct and that
  // there is not a piece of the same color on the to_square
  if (!(pos->PieceBB[side][0] & bbf) || (pos->PieceBB[side][0] & bbt)) return(0);

  // if capture, test that the bit for the captured piece is set
  if (tpc && !(pos->PieceBB[!side][tpc] & bbt)) return(0);

  // if promotion, test that this is pawn moving to 8th rank
  if ((mv & promoteAny) && (fpc != pawn || Rank(side,t) != RANK8)) return(0);

  int cpc = tpc; // captured piece
  int csq = t;

  switch (fpc)
  {
    case pawn:
      if (BB_PawnRays[side][f] & bbt)
        {
          if (t == pos->epsquare)
            {
              cpc = pawn;
              csq = t-pawn_dir[side];
              if (tpc != 0) return(0);
              if (!(pos->PieceBB[!side][pawn] & SQ2BB(csq))) return(0);
            }
          else if (!tpc) return(0);
        }
      else if (t == f+pawn_dir[side])
        {
          if (tpc) return(0);
        }
      else if (t == f+2*pawn_dir[side] && Rank(side,f) == RANK2)
        {
          if (tpc || Piece(pos,f+pawn_dir[side])) return(0);
        }
      else return(0);
      break;
    case knight:
      if (!(BB_KnightRays[f] & bbt)) return(0);
      break;
    case bishop:
      if (!(BB_BishopRays[f] & bbt) || (BB_Between[f][t] & occ)) return(0);
      break;
    case rook:
      if (!(BB_RookRays[f] & bbt) || (BB_Between[f][t] & occ)) return(0);
      break;
    case queen:
      if (!((BB_BishopRays[f] | BB_RookRays[f]) & bbt) ||
          (BB_Between[f][t] & occ)) return(0);
      break;
    case king:
      if (!(BB_KingRays[f] & bbt))
        {
          if (t == f+2 || t == f-2)
            {
              if (Rank(side,f) != RANK1) return(0);
              SQUARE s1,s2,s3;
              if (t > f) {s1=f+1; s2=f+2; s3=f+3;}
              else       {s1=f-1; s2=f-2; s3=f-4;}
              if (Piece(pos,s1) || Piece(pos,s2)) return(0);
              if (Piece(pos,s3) != Piece(pos,f)-2) return(0);
              if (SqAtakd(pos,f,!side)) return(0);
              if (SqAtakd(pos,t,!side)) return(0);
              if (SqAtakd(pos,s1,!side)) return(0);
            }
          else return(0);
        }
  }

  // Update bitboards to remove captured piece
  if (cpc)
    {
      pos->PieceBB[!side][cpc] ^= SQ2BB(csq);
      pos->PieceBB[!side][0]   ^= SQ2BB(csq);
    }

  // Update fsq & tsq bits of occupied piece bitboard
  BITBOARD bb = (SQ2BB(f) | SQ2BB(t));
  pos->PieceBB[side][0] ^= bb;

  // test whether own king is attacked
  SQUARE ksq = pos->kingsq[side];
  if (ksq == f) ksq = t;
  int legal = !SqAtakd(pos,ksq,!side);

  // restore bitboards
  pos->PieceBB[side][0] ^= bb;
  if (cpc)
    {
      pos->PieceBB[!side][cpc] ^= SQ2BB(csq);
      pos->PieceBB[!side][0]   ^= SQ2BB(csq);
    }

  return(legal);
}

Several years ago Anthony Cozzie and I wrote a paper for the ICGA journal addressing this issue. My original question was "how many hash collisions per search is enough to cause a different (worse) move to be played?" I set up the experiment with Crafty and discovered that our search trees are so large, and so redundant, that we could tolerate far more collisions than I thought possible. I contacted Cozzie and explained how I had tested. He modified Zappa to try the same test and got similar numbers to mine. The bottom line was that one collision out of every 1M positions caused zero problems, which surprised me. In fact, one error every 10K nodes had very little effect, not that I would suggest using such a small hash signature to produce this error rate.

Bottom line, a 32 bit key works more than good enough (I still use 64 bits myself, as one day the search speeds will reach the point where bigger is better since errors to accumulate at ultra-high NPS values. Worrying about it is wasted effort. The only critical aspect is to be certain that a hash collision will not crash your program. IE with Crafty, which uses a Make()/Unmake() design, trying a castling move when it is illegal would wreck things since unmaking the move could leave you with two kings, etc. If you can make certain a collision won't crash your program, you can feel secure in that the hash collisions are NOT going to cause any measurable effect for reasonable hash signature sizes.

bob · Post by **bob** » Fri Jan 31, 2020 6:04 am

D Sceviour wrote: ↑Tue Jan 28, 2020 9:54 pm
Daniel Shawul wrote: ↑Tue Jan 28, 2020 5:38 pm I think lockless hashing was pretty safe at modest number of threads most people have removed the need for legality checking of the hash move. Now it may be necessary to bring that back in.
Hash verification might be worth trying. In a recent 4 CPU gauntlet game with Deep Shredder-Schooner, black was showing a black to mate score and then played Kh7????

[d]5r1k/rb3P2/1n2B1pP/3pP1B1/1pnP4/1Pp5/2P5/1K1R3R b - - 4 36

It would be nice to know how this happened. It is not reproducible from 1 CPU to 8 CPU. I will try some hash position and move verification when using multiple threads to find if this makes a difference.

If I were betting, I would bet on a race-condition type bug rather than a hash collision. These kinds of issues will continue to plague parallel search engines for years, until they are slowly debugged. I personally still believe that the term "debugged parallel search engine" is an oxymoron. I just found such a bug a few weeks back when not really looking at the problem specifically, just encountered an anomaly that caught my eye and I decided to look more closely.

xr_a_y · Post by **xr_a_y** » Fri Jan 31, 2020 7:43 am

Based in Minic experience at TCEC I can tell on this subject that 2 debugging tricks are worth implementing in any engine :

- what ever is your hash size and hash check in TT probing (32, 64, 64^data, ...), instead of returning the data, just try to randomize returned data. This will put a lot of pressure on your "validatePseudoLegalMove" putting forward things that would not happen with only hash collision but also with "corrupted data". Combining this with many assert in "apply" and verifications that bitboard==material==board, this will show some errors quite fast. It can be used during real games or analyzing deeply many positions.

- the second trick is simply to change your perft routine so that instead of looping through generated moves, you generate ALL possible moves based on your move encoding. Be sure to really loop through everything. For instance if all your move encoding fits in one int, you loop from -int_max to int_max. And you only use move that succeed being validated by "validatePseudoLegalMove". You shall of course get the good perft number. This will also verify your validation function is bullet proof.

Those are a 2 or 3 days work of debugging and really really worth it.

jdart · Post by **jdart** » Fri Jan 31, 2020 1:13 pm

I have put in several fixes, see:

https://github.com/jdart1/arasan-chess/commits/master

These were tested by making the hash code equality test drop down to 4 bits (this is within a bucket), so you are going to get a lot of collisions.

There is a validMove function that does a sanity check on moves w/o a full legality check. This has been beefed up to cover more cases.

After a move is made, there is a check for legality of the previous move. This now covers the case I mentioned when a move from hash should be but wasn't a check evasion move. This caused a small rating regression.

This set of fixes appears to have removed the issues - I have a lot of asserts I can turn on and now can run with the artificially high collision rate and full debugging on, and there is no crash or assertion hit. But any kind of issue from moves the hash was already very rare. Arasan competed in the previous TCEC and had zero crashes on their big hardware at long time controls, even without any of these fixes. It was also playing very long time control games on ICC for several months on 24 cores, again with no issues.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Fri Jan 31, 2020 3:16 pm

This set of fixes appears to have removed the issues - I have a lot of asserts I can turn on and now can run with the artificially high collision rate and full debugging on, and there is no crash or assertion hit. But any kind of issue from moves the hash was already very rare. Arasan competed in the previous TCEC and had zero crashes on their big hardware at long time controls, even without any of these fixes. It was also playing very long time control games on ICC for several months on 24 cores, again with no issues.

You would need good legality testing if you play killer moves without generating non-captures, so I wonder if that does not stress-test your subroutine as corrupted hash moves do. I had left out legality testing thinking 64 bits are safe with lockless hash, but as Jstanback pointed out there are 6 collisions/min on TCEC machines which is pretty unsafe. Scorpio has not crashed on the 88 cores machine because it does not use CPU now, but it also didn't crash with the old 44 cores before without doing any sort of legality testing, so I believe you. Since I have to do legality testing now anyway, the question is what good is the lockless method now?

jdart · Post by **jdart** » Fri Jan 31, 2020 5:03 pm

There are two different things:

1. The possibility that distinct positions produce the same hash.
2. The possibility that two threads compete, one writing the hash entry while another is reading the same entry.

The lockless hash method prevents 2. I don't think it prevents 1 or resolves the possible issues from it, in particular fetching an illegal move.

---Jon

Daniel Shawul · Post by **Daniel Shawul** » Fri Jan 31, 2020 5:39 pm

jdart wrote: ↑Fri Jan 31, 2020 5:03 pm There are two different things:

1. The possibility that distinct positions produce the same hash.
2. The possibility that two threads compete, one writing the hash entry while another is reading the same entry.

The lockless hash method prevents 2. I don't think it prevents 1 or resolves the possible issues from it, in particular fetching an illegal move.

---Jon

Yes, but if you do legality testing, you don't need to solve (2). Many engines do not use lockless hashing including stockfish I think.
Infact, I used to do legality testing only and removed it after I added lockless hashing and it was pretty safe until now.
So lockless hashing looks redundant now. Or one can move to 128 bit hashkeys, as used in perfts, and still use the hash move without legality testing.

chrisw · Post by **chrisw** » Sat Feb 01, 2020 12:04 am

jdart wrote: ↑Fri Jan 31, 2020 5:03 pm There are two different things:

1. The possibility that distinct positions produce the same hash.

Two bad effects. A) Bad continuation move which either gets safely junked or passes with low grade search guidance. B) Bad eval - which is at least worth checking for being within possible eval bounds (some programs might crash on out of bounds eval), but it can still put a wrong eval to a subtree. Not sure how curable that is, probably one just has to live with the erroring and hope.

2. The possibility that two threads compete, one writing the hash entry while another is reading the same entry.

I would guess a double read with cross validation would fix this. Unless two other threads just happened to write to the same memory during the double read.

The lockless hash method prevents 2. I don't think it prevents 1 or resolves the possible issues from it, in particular fetching an illegal move.

---Jon

koedem · Post by **koedem** » Sat Feb 01, 2020 3:08 am

This made me think, what if you use a collision free board hashing + lockless hashing to possibly (?) get rid of these problems.
Here's some thoughts on that http://talkchess.com/forum3/viewtopic.php?f=7&t=72964
I don't think this gains Elo but it might at least be interesting to think about and/or try.

bob · Post by **bob** » Sat Feb 01, 2020 4:56 am

jdart wrote: ↑Fri Jan 31, 2020 5:03 pm There are two different things:

1. The possibility that distinct positions produce the same hash.
2. The possibility that two threads compete, one writing the hash entry while another is reading the same entry.

The lockless hash method prevents 2. I don't think it prevents 1 or resolves the possible issues from it, in particular fetching an illegal move.

---Jon

No, it doesn't prevent the technical collision (positions p1 and p2 produce the same hash signature). But since the early days of Cray Blitz, we always checked the hash move to be "legal enough" to not corrupt anything. We need the same protection since we also try killer moves before generating non-captures and we need to be able to confirm that they are also "legal enough" to not wreck anything. IE I could not tolerate having two (or more) kings for one side, as that would corrupt check detection and such. So I carefully looked at my code to see what was survivable and what was not. And I wrote a fairly minimalistic legal move tester. Crafty/Cray Blitz has always been able to dismiss most of the normal illegal moves (leaving the king in check, capturing a king, etc) so I only worried about the things that were problematic. One good one is en passant capturing since that removes a pawn from an occupied square and then later replaces it when unmaking the move. If that square is empty, or contains anything other than an enemy pawn, it would change the board and break the search later.

So the non-survivable moves are screened and away I go. I can say this. A 32 bit hash signature will give you more than enough protection from game-altering positional collisions. The paper Cozzie and I wrote clearly show that. Using 64 bits is overkill, but at little penalty so I did not change back to 32 bits myself.

I have noticed that many times, over the years, someone would claim a hash collision caused a bogus move. I am a firm believer that hash collision is simply a euphemism for "unexplained bug of unknown origin." I don't know that I have ever seen a real game issue that tracked back to a hash key collision.

hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions