TT: Overhead of incrementally updating Zobrist key

hgm · Post by **hgm** » Mon Feb 15, 2021 4:54 pm

Knowing that branches can slow down the CPU a lot I usually treat empty squares as just another piece type, of zero value and with zero key, so that every move can be treated as a capture. Due to the predominance of QS nodes most moves are true captures anyway.

BTW, you can use a trick here to eliminate the need for separate application of a side-to-move key: use the key you intended for that as the Zobrist key for the empty square (independent of location). Strictly speaking, to get exactly the same result, you should then also XOR all the Zobrist keys of the other pieces with this same key, but since they are all randomly chosen anyway that would not resort in any effect. But if you imagine that every key (including that for the empty square, which would normally be 0) is XOR'ed with a constant C, then the incremental update would do (C^fromKey) ^(C^toKey) ^ (C^victimKey) = C^(fromKey^toKey^victimKey), so that you get the XOR with C for free.

lojic · Post by **lojic** » Mon Feb 15, 2021 8:41 pm

Joost Buijs wrote: ↑Mon Feb 15, 2021 2:13 pm Nightmare runs perft(5) for this fen in 2 seconds (without bulk-counting).

Timings according to Intel VTune are:

generate_captures: 93 ms.
generate_quiet_moves: 200 ms.
legality_check: 200 ms.
do_move: 578 ms.
undo_move: 722 ms.
perft: 200 ms.

Most of the time is spend in do_move and undo_move.

I use a partial copy/make. Things like the hashkeys are copied, other things like the board array are updated/downdated.

Thanks for those timings - I need to work on my legality check!

lojic · Post by **lojic** » Mon Feb 15, 2021 8:52 pm

mvanthoor wrote: ↑Mon Feb 15, 2021 12:45 pm
lojic wrote: ↑Sun Feb 14, 2021 10:58 pm so roughly 27% of the time is make/unmake move. I was testing that FEN w/ a depth of 4 to allow more measurements, but the overhead seems irrespective of depth. Computing perft for that FEN w/ a depth of 5 takes 40.5 seconds for 146,963,716 total nodes for 3.6M/sec.
What language is your engine written in; something like Python? Does it use copy/make instead of make/unmake? (Or both... an interpreted language, and copy/make with lots of board copying?)
...

I'm using Racket, which is a Scheme dialect. It's compiled, and it's one of the faster Scheme dialects. Significantly faster than interpreted langs, such as Ruby/Python, but also significantly slower than C/C++ for code, such as a chess engine, that's in the sweet spot for C/C++.

I do make/unmake for the board, but I use a stack for the following state variables which are all packed into a single integer:

Code: Select all

(fixnum-fields state ([ ep-idx          7        ]
                      [ black-king-idx  7        ]
                      [ white-king-idx  7        ]
                      [ whites-move     1 #:flag ]
                      [ w-queenside-ok  1 #:flag ]
                      [ w-kingside-ok   1 #:flag ]
                      [ b-queenside-ok  1 #:flag ]
                      [ b-kingside-ok   1 #:flag ]))

mvanthoor · Post by **mvanthoor** » Mon Feb 15, 2021 10:58 pm

Oh yes, the engine in Racket. Sorry, I forgot. Then the other one bey user eigolf must be the Python one.

lojic · Post by **lojic** » Tue Feb 16, 2021 3:53 am

lojic wrote: ↑Sun Feb 14, 2021 10:58 pm I make the last ply so I can check the legality of the move. I use perft both for testing and for timing the generate/make/unmake parts of the engine.

I just ran a profile, and the perft spends most of the time as follows:
is-legal? 51.4%
unmake-move 13.8%
make-move 13.5%
generate-moves 10.9%
other 10.4%

so roughly 27% of the time is make/unmake move. I was testing that FEN w/ a depth of 4 to allow more measurements, but the overhead seems irrespective of depth. Computing perft for that FEN w/ a depth of 5 takes 40.5 seconds for 146,963,716 total nodes for 3.6M/sec.
Code: Select all
cpu time: 39725 real time: 40529 gc time: 3
perft(5) results:
  Totals:     146963716
  Captures:   28490460
  EP Cap:     58509
  Castles:    0
  Promotions: 24660

I made a ton of performance enhancements today, mostly using Racket's "unsafe" operations which avoid some runtime checking overhead. This dropped the time from 40.5 seconds down to 26.7 seconds, so that's nice. I haven't even begun to address the Zobrist key update issues that several of you have provided tips for. I think I may be able to get it down to ~ 20 seconds with those changes. Being 10x slower than a tuned C implementation wouldn't be a huge surprise in the context of chess engine programming.

I have no idea what the engine's current rating against other engines is, but playing many 10+5 and 15+10 games against humans puts it close to 1,900 (with me micro-managing the time a lot). While I do have a TT in place now, it's not using the best move from the TT yet, so that should help a *lot* I think when I add that.

Profiling after the performance tweaks gives:

Code: Select all

is-legal?      = 15,268 ms
unmake         =  3,472 ms
generate-moves =  3,128 ms
make-move      =  3,032 ms
perft itself   =  2,448 ms
Total          = 27,348 ms

So actually, given that profile and the tips you all have provided, is-legal? should probably be cut in half, at least, so sub-20 seconds seems very possible now w/o even being very clever.

hgm · Post by **hgm** » Tue Feb 16, 2021 10:20 am

The best speedup of a code section is usually achieved by not exectuting it at all.

mvanthoor · Post by **mvanthoor** » Tue Feb 16, 2021 10:32 am

lojic wrote: ↑Tue Feb 16, 2021 3:53 am ...

I'd be happy to run a quick test for you with the engine. I managed to compile it, but It doesn't behave like a UCI (or XBoard) engine as it tries to play in the console.

There's no further information "chess.exe --help" does nothing, and I'm not able to quickly find out if "raco exe ..." actually creates a release executable. The docs say that the program is compiled into byte code and the packaged within the racket executable; it does create a 30 MB exe. I have that now... so how can I run perft, or a tournament?

lojic · Post by **lojic** » Tue Feb 16, 2021 4:49 pm

mvanthoor wrote: ↑Tue Feb 16, 2021 10:32 am
lojic wrote: ↑Tue Feb 16, 2021 3:53 am ...
I'd be happy to run a quick test for you with the engine. I managed to compile it, but It doesn't behave like a UCI (or XBoard) engine as it tries to play in the console.

There's no further information "chess.exe --help" does nothing, and I'm not able to quickly find out if "raco exe ..." actually creates a release executable. The docs say that the program is compiled into byte code and the packaged within the racket executable; it does create a 30 MB exe. I have that now... so how can I run perft, or a tournament?

Yes, I decided to add the transposition table feature before the UCI feature. So it currently only plays in the console. When you start it, enter w or b to indicate which color the engine should play. When you enter a move (PGN format e.g. Qd4 Nge7 O-O) you can optionally enter a time in minutes for it to think. It starts with 5 seconds, but if you enter: e4 30 it will think for 30 seconds, and think time will stay at 30 until changed. No need to make an executable, you can just run "racket chess.rkt". Also, I use "rlwrap" on my system to wrap the program w/ readline support so you can backspace, etc. e.g. rlwrap racket chess.rkt

I realize I'm probably the only one motivated to go through that at this point, so I'll let you know when UCI support is in :)

I'm suspecting that the ELO rankings for engines playing each other is much lower than human ELO rankings because my engine is consistently beating folks around 1,900, and got a draw with a 2,000+, but it got crushed easily by wukong even when I gave it 40 seconds to think and wukong was moving much faster.

lojic · Post by **lojic** » Tue Feb 16, 2021 4:53 pm

mvanthoor wrote: ↑Tue Feb 16, 2021 10:32 am ... so how can I run perft, or a tournament?

I missed the perft question. To run perft: racket perft.rkt

The depth and start are currently hardcoded here - you can change max-level to another value, and it currently has that FEN starting position from earlier in this thread. You can replace that with another FEN (it does require the half-move and full-move at the end, you can just use 0 1)

mvanthoor · Post by **mvanthoor** » Tue Feb 16, 2021 6:15 pm

lojic wrote: ↑Tue Feb 16, 2021 4:49 pm I realize I'm probably the only one motivated to go through that at this point, so I'll let you know when UCI support is in

OK; at that point I can run the engine through my own gauntlet, which I use for testing Rustic.

My advice would be to build the UCI (or XBoard, or both) function as soon as possible, because then you will be able to determine the strength of each improvement you make. This is why I released my engine as Rustic Alpha 1, with no features apart from MVV-LVA move sorting. (This is very easy to implement, and not adding it just makes the engine do massive amounts of useless work, and thus it will search less deep and be weaker.)

I'm suspecting that the ELO rankings for engines playing each other is much lower than human ELO rankings because my engine is consistently beating folks around 1,900, and got a draw with a 2,000+, but it got crushed easily by wukong even when I gave it 40 seconds to think and wukong was moving much faster.

I've just started a tournament against WukongJS 1.5. I see the engine already has a hash table, much better sorting, and a more extensive evaluation than Rustic. Even in the middle game, WukonJS outcalculates Rustic by 2-3 ply (this increases up to 5-6 ply in the endgame) even though Rustic is literally almost 8-10x as fast.

I assume that the hash table, better move sorting and better evaluation gives WukongJS too much of a head-start right now; both against Rustic, and your engine. (Is your engine called "RacketChess", "Racket", or something else, by the way?)

If Wukong, as a Javascript engine, can reach depth 13-14 without breaking a sweat, I'm curious to see what Rustic can do once I finally get the time to figure out how to implement a hash table in Rust in such a way that I can re-use it for Perft, playing, and pawn hashes without having to implement it three times.

TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key

Re: TT: Overhead of incrementally updating Zobrist key