engine speed issues

algerbrex · Post by **algerbrex** » Wed Jul 20, 2022 7:29 am

I would concur with the others here, 2Mnps seems quite good to me. My magic bitboard engine, Blunder, usually gets ~1Mnps, although its eval is decently bulky, having PST, material, mobility, outposts, pawn structure, rook structure, king safety, and drawn and drawish endgame detection.

The majority of time in my engine is spent in quiescence search during a search, and I use a basic sorting algorithm where I scan for the move with the next highest score and place it at the current index (I can't take credit for this idea, I came across it on here). It works well in practice for both the main search and quiescence search.

Once you've sorted out these issues, I wouldn't worry too much about raw speed. Speed will make your engine stronger, and by all means, continue to optimize, but I there's only so much to be done speed-wise, and I think it can often be much more practical to focus on just improving the search through refactoring and selective search techniques, and the evaluation through adding important themes, like some of the ones I listed above.

Over the past year, my engine has gone from ~1400 to ~2700, so about 1300 Elo. Of that 1300 Elo, I would attribute maybe 100 of it to me improving the raw make/unmake and move generation speed of Blunder, and even that's being a bit generous. Of course, gaining Elo isn't my only goal, and I assume it isn't yours either, but it's just illustrative of the cost-to-benefit ratio of trying to squeeze every last drop of speed from the make/unmake and move generation phase.

hgm · Post by **hgm** » Wed Jul 20, 2022 10:28 am

It might be worth pointing out that I managed to reach a single-threaded search speed of 8Mnps on a 2.2GHz CPU in The Mailbox Trials. (Someone else using the code on an AMD Ryzen reached 11.6Mnps with it.) But this was without a hash table, and only using PST evaluation. Benchmark was a 12-ply search on the KiwiPete position.

And of course it was based on mailbox, not bitboard.

Sazgr · Post by **Sazgr** » Wed Jul 20, 2022 4:15 pm

hgm wrote: ↑Tue Jul 19, 2022 8:52 pm Well, sorting algorithms usually scale worse than linear with the number of moves, and there are many more quiets than captures. I am not sure though if std::sort would not be mentioned separately in the profile (as part of the 3%). Low number of QS nodes is certainly suspect; most of the search nodes must be leaves, so you have very few QS calls per leaf.

What does search do anyway, if you count move generation, make/unmake and check testing separately anyway. It sounds like it should just be a tight loop over moves where an iteration does almost nothing. (Because if(score > alpha) only rarely would be true.) Surely make + unmake should do by far the most work in that loop?

I figured out the suspicious number of Qs nodes.

Apparently I was not incrementing the qnodes counter when I made a move from depth 1 into quiescence search

Anyways, now on a depth 10 search I have 5.25 milion Qs nodes and 6.09 million nodes for a healthy 86.2% percentage.
However, Qsearch is still only taking 2-3% of the time so I am looking at issues with the profiler.

Sazgr · Post by **Sazgr** » Wed Jul 20, 2022 4:38 pm

clayt wrote: ↑Tue Jul 19, 2022 10:28 pm
Sazgr wrote: ↑Tue Jul 19, 2022 7:45 pm
clayt wrote: ↑Tue Jul 19, 2022 6:55 pm I can't say I know too much about what's going on here, but I can offer some guesses.
1. Is your search single-threaded? In this case, 2Mnps is actually not super unreasonable.
2. You seem to be calling `move.start()` and `move.end()` a lot. I assume that each time, it masks and shifts some information from your move - in this case, I recommend storing those values to save some effort.

Spending about that much time on making moves seems roughly correct, but 9.6% of your time on undo is a bit high. Are you doing copy-and-make or are you actively updating the board on undo?

Lastly: how much time is spent on move generation? I suggest that you use your profiling data to generate a flamegraph, since that will make it a little easier to see where time is being spent.
1. Yes my search is single threaded. So it is only slightly unreasonable? What would you expect for a single threaded search?

2. OK I will try that.

I am doing make-unmake, my unmake function looks pretty similar except i just pop castling and ep rights off the stacks

25-30% of the time is for move generation, and it is mostly generating the attack map (6% of the 25-30%) (for king move and castling legality checking) and slider attacks

And yes I will try the flamegraph
2Mnps is quite fast for single-threaded search, assuming you're doing evaluations. If you want a more accurate estimate as to whether your make-unmake process is fast enough, consider writing a perft routine and benchmarking that, so that time spent on updating PST values and other such things isn't wasted.

You might find that you can get performance improvements from copy-on-make in some circumstances. It helps to have some perft data to verify which approach is better for you.

For some data, my engine, which uses copy-on-make and likely spends a little more time on evaluation, runs at 1.5-1.9Mnps in a single thread. In terms of proportional calls observed by perf, here's an example case of my engine searching to a depth of 10:

59.6% Quiescence.

22.1%: move generation (of which 12.7% of the total is spent tagging moves with their effect on incremental evaluation).

14.7%: Move make.

9.9%: Transposition table probe.

9.1%: Leaf evaluation.

7.7%: Game over

5.8%: Move sorting (using a lazy insertion sort).

3.9%: Transposition table storage.

3.6%: Move unmake.

I have tested my perft, it goes at around 3-5 Mnps without bulk-counting, and 60 with.

Thanks for the data!

Sazgr · Post by **Sazgr** » Wed Jul 20, 2022 4:44 pm

algerbrex wrote: ↑Wed Jul 20, 2022 7:29 am I would concur with the others here, 2Mnps seems quite good to me. My magic bitboard engine, Blunder, usually gets ~1Mnps, although its eval is decently bulky, having PST, material, mobility, outposts, pawn structure, rook structure, king safety, and drawn and drawish endgame detection.

The majority of time in my engine is spent in quiescence search during a search, and I use a basic sorting algorithm where I scan for the move with the next highest score and place it at the current index (I can't take credit for this idea, I came across it on here). It works well in practice for both the main search and quiescence search.

Once you've sorted out these issues, I wouldn't worry too much about raw speed. Speed will make your engine stronger, and by all means, continue to optimize, but I there's only so much to be done speed-wise, and I think it can often be much more practical to focus on just improving the search through refactoring and selective search techniques, and the evaluation through adding important themes, like some of the ones I listed above.

Over the past year, my engine has gone from ~1400 to ~2700, so about 1300 Elo. Of that 1300 Elo, I would attribute maybe 100 of it to me improving the raw make/unmake and move generation speed of Blunder, and even that's being a bit generous. Of course, gaining Elo isn't my only goal, and I assume it isn't yours either, but it's just illustrative of the cost-to-benefit ratio of trying to squeeze every last drop of speed from the make/unmake and move generation phase.

When I last tried to implement the move picker, my speed decreased

If you don't mind, can I take a look at your implementation?

hgm · Post by **hgm** » Wed Jul 20, 2022 5:09 pm

Sazgr wrote: ↑Wed Jul 20, 2022 4:15 pm
hgm wrote: ↑Tue Jul 19, 2022 8:52 pm Well, sorting algorithms usually scale worse than linear with the number of moves, and there are many more quiets than captures. I am not sure though if std::sort would not be mentioned separately in the profile (as part of the 3%). Low number of QS nodes is certainly suspect; most of the search nodes must be leaves, so you have very few QS calls per leaf.

What does search do anyway, if you count move generation, make/unmake and check testing separately anyway. It sounds like it should just be a tight loop over moves where an iteration does almost nothing. (Because if(score > alpha) only rarely would be true.) Surely make + unmake should do by far the most work in that loop?
I figured out the suspicious number of Qs nodes.

Apparently I was not incrementing the qnodes counter when I made a move from depth 1 into quiescence search
Anyways, now on a depth 10 search I have 5.25 milion Qs nodes and 6.09 million nodes for a healthy 86.2% percentage.
However, Qsearch is still only taking 2-3% of the time so I am looking at issues with the profiler.

That indeed sounds more like it should be!

The hight time usage of Search is still suspect, though. Normally search routines do very little; the actual work is done by move generation and make/unmake, while hash probing is slow because of the large DRAM access time (which stalls the CPU). Of course it is true that in the full-width search you have to loop through quiets as well as captures, and there usually are much fewer of the latter. But even then, about half of the nodes should be cut-nodes which only search very few moves. Often only the null move. And all moves that are searched must be made and unmade. It is very strange that this takes less time than that search has to spend on them.

algerbrex · Post by **algerbrex** » Wed Jul 20, 2022 5:18 pm

Sazgr wrote: ↑Wed Jul 20, 2022 4:44 pm When I last tried to implement the move picker, my speed decreased

If you don't mind, can I take a look at your implementation?

Sure! To be honest, I actually don't have a move picker in my current versions of Blunder. I implemented it but never got much of an Elo gain, so I didn't keep it in, although I did get a decent speed increase. I still have the code though, available on this branch here: https://github.com/algerbrex/blunder/bl ... movegen.go, and where I use it in the search here: https://github.com/algerbrex/blunder/bl ... ch.go#L450. Help yourself to looking through the code!

abulmo2 · Post by **abulmo2** » Wed Jul 20, 2022 7:37 pm

Of course, your engine speed depends on the complexity of the algorithm used, the eval, etc.
Dumb is a simple program using a simple PST based eval and simple search algorithms, a simple bitboard legal move generator; so, far from being optimized but simple enough to be pretty fast.
Its speed is about 4.27 Mnps on my computer (old Ryzen 1700x).
The time spent in the (non-inlined) functions is distributed in the following way:

Code: Select all

     20.39% void move.Moves.generate!(true).generate(board.Board, ref const(move.History), const(ushort), const(ushort[2]))
    18.70% int search.Search.qs(int, int)
    14.74% int search.Search.pvs(int, int, const(int))
    13.22% const int board.Board.see(const(ushort))
     9.57% void board.Board.setPinsCheckers(ref ulong, ref ulong)
     7.15% const void board.Board.generateMoves!(true).generateMoves(ref move.Moves)
     5.39% void board.Board.update!(true).update(const(ushort))
     5.05% void eval.Eval.update(const(board.Board), const(ushort))
     3.51% void board.Board.restore(const(ushort))
     1.20% const bool board.Board.isDraw()
     0.92% @trusted void search.TranspositionTable.store(const(board.Key), const(int), const(int), const(search.Bound), const(int), const(ushort))
     0.05% void move.History.update(const(board.Board), const(ushort), const(uint), ref ushort[64][14])

Personally, I always got trouble to understand such numbers. For example SEE (Static Exchange Evaluation) consumes 13.22% of the overall time, but if I remove it, the speed will climb up to about 9 Mnps. So the result of an optimization is always hard to predict.

Sazgr · Post by **Sazgr** » Thu Jul 21, 2022 4:23 pm

It appears now that the time spent in PVS has decreased to a reasonable amount, I think now I will concentrate on selectivity like you all suggest

I have futility, LMR, delta pruning, and null move pruning in my engine, does 6 million nodes at depth 10 sound about right, or is there a probable bug

At least the move ordering seems OK, there were beta cutoffs on 97.25% of the time on the first move in a kiwipete search

mvanthoor · Post by **mvanthoor** » Fri Jul 22, 2022 12:09 am

To the OP: what language is the engine written in? I see the ".back()" method a lot. Are you using vectors for everything? If so, all of your stuff is stored on the heap, which is a lot slower to access than the stack. Personally I'd gladly use vectors (and some other nice/convenient structures) for things like move lists, but when I tried it, the engine speed dropped by 8-10% for each array that I replaced with a vector. The only place where I use a vector is for the PV, because it is very convenient and not slower than an array; if the array turns out to be too small (= PV gets too long for it) having to copy it manually is more of a hassle than to trust the compiler to extend the vector. (And I start it with a reserved space to begin with.)

So if you're now using vectors everywhere, the first thing I'd try is to replace those with arrays where possible.

Chessnut1071 wrote: ↑Wed Jul 20, 2022 3:20 am 2Mnps is about the same as my PEXT bitboard engine, which includes evaluation. If you're using an older chip, single thread i7, most of the recently developed bitboard engines are very close to your speed. Obviously, there are developers on this board with significantly faster engines, but, that comes after years of tweaking and adjusting.

My engine searches at ~6.8 million nodes/second on an i7-6700K from 2015. Granted, it still has a very simple PST-only evaluation, but it does keep Zobrist hashes and two sets of PST's incrementally. There are several newer engines I'm aware of that are MUCH faster than 2 Mnps, even some that are written in (relatively) slow programming languages. (Slow from the point of view of C/C++/Rust, that is.)

engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues

Re: engine speed issues