TT: Overhead of incrementally updating Zobrist key

lojic · Post by **lojic** » Sun Feb 14, 2021 7:57 pm

I'm a little surprised with the overhead I'm measuring for incrementally updating the Zobrist key. Before implementing the TT, I figured I'd put the code in place to incrementally update the key and test it's working properly. I've run through a large perft test suite and ensured that incrementally updating the key through all the make/unmake moves results in the same key originally computed.

It appears that simply adding the code to incrementally update the key has increased my perft duration by roughly 10%. For comparison, the increase w/ legality checking (the normal case) adds 67% duration compared to skipping legality checking.

I used the following FEN to start with, since it seems more typical - both sides have already castled, and some pawns have moved:
2kr3r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/2KR3R w - - 0 1

I tested each of the Zobrist updates separately (many times), here is the overhead for each one for that FEN above:
Castling rights = 1.2%
EP square = 1.2%
Piece movement = 6.6%
Side to move = 1.8%
All active = 10.5%

I'm sure the benefits will greatly outweigh that overhead once I get the complete TT in place, but had I guessed ahead of time, I think I'd expect something closer to 3% to 4%.

hgm · Post by **hgm** » Sun Feb 14, 2021 10:06 pm

The percentages don't mean much if we don't know how fast your perft is. And it depends a lot on whether you make the last ply or not. I suppose you update the key in MakeMove. And if you are making the last ply, your perft is basically timing the speed of your MakeMove(), as that is what it will be doing virtually the entire time. Now MakeMove() doesn't have to do all that much. So updating the Zobrist key can easily become a large percentage.

If you tell how many seconds the update adds to perft(6), assuming that it was done some 120M times there, tells us how many nsec a single update took, and whether that looks reasonable. It can of course also depend on how you exactly do the update.

lojic · Post by **lojic** » Sun Feb 14, 2021 10:58 pm

hgm wrote: ↑Sun Feb 14, 2021 10:06 pm The percentages don't mean much if we don't know how fast your perft is. And it depends a lot on whether you make the last ply or not. I suppose you update the key in MakeMove. And if you are making the last ply, your perft is basically timing the speed of your MakeMove(), as that is what it will be doing virtually the entire time. Now MakeMove() doesn't have to do all that much. So updating the Zobrist key can easily become a large percentage.

If you tell how many seconds the update adds to perft(6), assuming that it was done some 120M times there, tells us how many nsec a single update took, and whether that looks reasonable. It can of course also depend on how you exactly do the update.

I make the last ply so I can check the legality of the move. I use perft both for testing and for timing the generate/make/unmake parts of the engine.

I just ran a profile, and the perft spends most of the time as follows:
is-legal? 51.4%
unmake-move 13.8%
make-move 13.5%
generate-moves 10.9%
other 10.4%

so roughly 27% of the time is make/unmake move. I was testing that FEN w/ a depth of 4 to allow more measurements, but the overhead seems irrespective of depth. Computing perft for that FEN w/ a depth of 5 takes 40.5 seconds for 146,963,716 total nodes for 3.6M/sec.

Code: Select all

cpu time: 39725 real time: 40529 gc time: 3
perft(5) results:
  Totals:     146963716
  Captures:   28490460
  EP Cap:     58509
  Castles:    0
  Promotions: 24660

AndrewGrant · Post by **AndrewGrant** » Mon Feb 15, 2021 4:32 am

lojic wrote: ↑Sun Feb 14, 2021 7:57 pm I'm a little surprised with the overhead I'm measuring for incrementally updating the Zobrist key. Before implementing the TT, I figured I'd put the code in place to incrementally update the key and test it's working properly. I've run through a large perft test suite and ensured that incrementally updating the key through all the make/unmake moves results in the same key originally computed.

It appears that simply adding the code to incrementally update the key has increased my perft duration by roughly 10%. For comparison, the increase w/ legality checking (the normal case) adds 67% duration compared to skipping legality checking.

I used the following FEN to start with, since it seems more typical - both sides have already castled, and some pawns have moved:
2kr3r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/2KR3R w - - 0 1

I tested each of the Zobrist updates separately (many times), here is the overhead for each one for that FEN above:
Castling rights = 1.2%
EP square = 1.2%
Piece movement = 6.6%
Side to move = 1.8%
All active = 10.5%

I'm sure the benefits will greatly outweigh that overhead once I get the complete TT in place, but had I guessed ahead of time, I think I'd expect something closer to 3% to 4%.

Any code snippets of how you are updating the Zobrist? 10% seems high even if we assume you have an insanely efficient implementation for everythnig else. Maybe some egregious cache issues?

Sven · Post by **Sven** » Mon Feb 15, 2021 7:55 am

lojic wrote: ↑Sun Feb 14, 2021 10:58 pm
hgm wrote: ↑Sun Feb 14, 2021 10:06 pm The percentages don't mean much if we don't know how fast your perft is. And it depends a lot on whether you make the last ply or not. I suppose you update the key in MakeMove. And if you are making the last ply, your perft is basically timing the speed of your MakeMove(), as that is what it will be doing virtually the entire time. Now MakeMove() doesn't have to do all that much. So updating the Zobrist key can easily become a large percentage.

If you tell how many seconds the update adds to perft(6), assuming that it was done some 120M times there, tells us how many nsec a single update took, and whether that looks reasonable. It can of course also depend on how you exactly do the update.
I make the last ply so I can check the legality of the move. I use perft both for testing and for timing the generate/make/unmake parts of the engine.

I just ran a profile, and the perft spends most of the time as follows:
is-legal? 51.4%
unmake-move 13.8%
make-move 13.5%
generate-moves 10.9%
other 10.4%

so roughly 27% of the time is make/unmake move. I was testing that FEN w/ a depth of 4 to allow more measurements, but the overhead seems irrespective of depth. Computing perft for that FEN w/ a depth of 5 takes 40.5 seconds for 146,963,716 total nodes for 3.6M/sec.
Code: Select all
cpu time: 39725 real time: 40529 gc time: 3
perft(5) results:
  Totals:     146963716
  Captures:   28490460
  EP Cap:     58509
  Castles:    0
  Promotions: 24660

OT remark:
Your perft as well as your search will run an order of magnitude faster if you do not check each single move for legality but just those few moves that can ever be illegal: king moves, replies to check, ep.captures, and moves of a pinned piece. This will only require the small effort of implementing a pin detection routine that gets called once per node. When done, legality check will no longer dominate your runtime profile. Furthermore it will be only a very small step from there to adding bulk counting to perft(), so that you can save a lot of time when waiting for your perft test suite to complete ...

Now regarding hash key update in MakeMove(): In Jumbo I avoid to update things that are not needed for perft. For that purpose I use C++ templates, so in fact the compiler produces two different MakeMove() functions, a complete version for search and a restricted version for perft.

hgm · Post by **hgm** » Mon Feb 15, 2021 10:15 am

So roughly 150M moves take 40 sec, which makes 266 ns move. If 10% of that was because of the key update, this update apparently takes 27ns. Assuming your machine runs at around 3GHz that is 80 clock cycles, in which a modern CPU should be able to perform 320 instructions.

So yes, that is excessively long, for something as simple as a key update.

My key update typically looks like

hashKey = oldHashKey ^ Zobrist[piece][fromSqr] ^ Zobrist[promotedPiece][toSqr] ^ Zobrist[victim][captureSqr];

and there is nothing in UnMake (but a 1-time copying oldHashKey = hashKey; when I enter the node). Zobrist[] then is an array of pointers to board-size tables of 64-bit integers, which logically is part of the piece list. So even assuming none of the variables was already in a register (which seems unlikely, during MakeMove(), as we just used them for updating the board), this boils down to

6 memory reads of the piece and the square numbers (bytes or 32-bit ints local variables, so indexed by the frame pointer, which is kept in a register)
3 memory reads of 64-bit pointers from the global Zobrist[] tables, indexed by the piece numbers
3 memory reads of 64-bit keys, indexed by the just fetched pointers and the square numbers
1 memory read of the 64-bit oldHashKey
4 XOR instructions
1 memory store

The load/store unit can typically do 1 64-bit read per clock, or two 32-bit reads, so it seems you have 10 clock cycles worth of memory reading. And 1 memory store. So it should be doable in 11 clocks. Which is a lot shorter than 80. The XOR instructions can be executed in parallel (as well as a lot of other stuff for the MakeMove, as you have room for 33 ALU instructions next to the load/stores in these 11 clocks, but the whole MakeMove code could be dominated by load/stores).

BTW, it always annoyed me to do the key update in MakeMove(). Because one thing I need the key for is to detect repetitions. And when I detect a repetition only after the I have made the move, I must immediately UnMake() it. I'd rather not made the move at all, in that case. It seems hard to separate the key update from MakeMove in an efficient way, though, as both might involve move decoding, to get the square and piece numbers from a move that has packed them into an integer. So I guess the best way is to make MakeMove() abortable, let it start with decoding the move and updating the key, then check for repetition, and immediately return with an error code before it does anything else (including storing the newly calculated key). So that you can write something like if(MakeMove()) RecursiveSearch(); else DetermineDrawScore();.

As to the excessively large time spent in legality testing: even when you do not test for pinned pieces before move generation, testing whether a piece could be pinned doesn't have to be that expensive either. If you have to evade an existing check it is another matter (only a small fraction of the nodes has that), but normally you can just test for the piece being a King, or being aligned with the King. (And if you arrange it such that a King is considered alighned with itself, this can be ddone in a single test.) Most pieces will not be aligned, and only if they are aligned you have to do a full legality test (figuring out if the path bettween King and piece is free, and what was behind that piece). A simple array access aligned[fromSqr-kingSqr] should be able to tell you whether the move could be illegal, and that should be far cheaper as a hash-key update. In most cases you wouldn't have to do anything else, as the moved piece would not have been aligned.

mvanthoor · Post by **mvanthoor** » Mon Feb 15, 2021 12:45 pm

lojic wrote: ↑Sun Feb 14, 2021 10:58 pm so roughly 27% of the time is make/unmake move. I was testing that FEN w/ a depth of 4 to allow more measurements, but the overhead seems irrespective of depth. Computing perft for that FEN w/ a depth of 5 takes 40.5 seconds for 146,963,716 total nodes for 3.6M/sec.

What language is your engine written in; something like Python? Does it use copy/make instead of make/unmake? (Or both... an interpreted language, and copy/make with lots of board copying?)

Rustic runs perft(5) on this FEN in roughly 4 seconds, with a speed of 36.1M leaves/sec, without a hash table. If I initialize a 4 MB hash table, execution time drops to roughly 1.67 seconds. I haven't yet implemented the pin/legality check optimizations, or bulk counting Sven mentions. (Those are planned for a later date.)

Joost Buijs · Post by **Joost Buijs** » Mon Feb 15, 2021 2:13 pm

Nightmare runs perft(5) for this fen in 2 seconds (without bulk-counting).

Timings according to Intel VTune are:

generate_captures: 93 ms.
generate_quiet_moves: 200 ms.
legality_check: 200 ms.
do_move: 578 ms.
undo_move: 722 ms.
perft: 200 ms.

Most of the time is spend in do_move and undo_move.

I use a partial copy/make. Things like the hashkeys are copied, other things like the board array are updated/downdated.

mvanthoor · Post by **mvanthoor** » Mon Feb 15, 2021 4:09 pm

Joost Buijs wrote: ↑Mon Feb 15, 2021 2:13 pm I use a partial copy/make. Things like the hashkeys are copied, other things like the board array are updated/downdated.

Same; I copy the game-state to and from history, but all the rest is updated incrementally.

lojic · Post by **lojic** » Mon Feb 15, 2021 4:15 pm

AndrewGrant wrote: ↑Mon Feb 15, 2021 4:32 am
lojic wrote: ↑Sun Feb 14, 2021 7:57 pm I'm a little surprised with the overhead I'm measuring for incrementally updating the Zobrist key. Before implementing the TT, I figured I'd put the code in place to incrementally update the key and test it's working properly. I've run through a large perft test suite and ensured that incrementally updating the key through all the make/unmake moves results in the same key originally computed.

It appears that simply adding the code to incrementally update the key has increased my perft duration by roughly 10%. For comparison, the increase w/ legality checking (the normal case) adds 67% duration compared to skipping legality checking.

I used the following FEN to start with, since it seems more typical - both sides have already castled, and some pawns have moved:
2kr3r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/2KR3R w - - 0 1

I tested each of the Zobrist updates separately (many times), here is the overhead for each one for that FEN above:
Castling rights = 1.2%
EP square = 1.2%
Piece movement = 6.6%
Side to move = 1.8%
All active = 10.5%

I'm sure the benefits will greatly outweigh that overhead once I get the complete TT in place, but had I guessed ahead of time, I think I'd expect something closer to 3% to 4%.
Any code snippets of how you are updating the Zobrist? 10% seems high even if we assume you have an insanely efficient implementation for everythnig else. Maybe some egregious cache issues?

It's been decades since I did much serious bit twiddling (and never in lisp :) ), and I had a misunderstanding about whether xor operations needed to be ordered, so I'm doing a lot of unnecessary work. Here's an example:

Code: Select all

(define-inline (move-piece! squares from-piece from-idx to-piece to-idx dst-piece)
  (xor-piece! from-piece from-idx)            ; Remove piece from zobrist
  (when (not (fx= dst-piece empty-square))
    (xor-piece! dst-piece to-idx))            ; Remove captured from zobrist
  (xor-piece! to-piece to-idx)                ; Add piece to zobrist
  (bytes-set! squares to-idx to-piece)        ; Place piece at destination
  (bytes-set! squares from-idx empty-square)) ; Remove piece from source

(define-inline (xor-piece! piece idx)
  (set! hash-key (xor-piece hash-key piece idx)))

(define-inline (xor-piece key piece idx)
  (fxxor key (fxvector-ref square-keys
                           (fx+ (fx* 120 (piece-zobrist piece))
                                idx))))

Taking advantage of order independence and the fact that xor'ing 0 is a no-op will allow me to trim that down considerably.

TT: Overhead of incrementally updating Zobrist key

TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key