engine speed issues

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

mvanthoor wrote: Fri Jul 22, 2022 12:09 amMy engine searches at ~6.8 million nodes/second on an i7-6700K from 2015. Granted, it still has a very simple PST-only evaluation, but it does keep Zobrist hashes and two sets of PST's incrementally. There are several newer engines I'm aware of that are MUCH faster than 2 Mnps, even some that are written in (relatively) slow programming languages. (Slow from the point of view of C/C++/Rust, that is.)
That is for single threaded? And do you use futility (delta) pruning? Such pruning usually speeds up the engine a lot (time-to-depth-wise) by pruning about half the QS nodes, but at the same time lowers the nps a lot. Because all the nodes you prune were nodes where you basically did nothing else than incrementing the node counter, so that they are almost infinitely fast. (You would stand pat there on the incremental evaluation score, even before you generated any moves.)
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: engine speed issues

Post by dangi12012 »

make/unmake 23%

This should be halved or free.

I don't know why all examples have this but you are walking a recursion. Unmake is not needed.
Just have a your apply-move emplace the next state into the next call.
Does work with Bitboards perfectly - with Mailslot it's too expensive to copy. In any case the stack remembers your current move among alpha and beta implicitly.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

With mailbox make and unmake together are usually a lot faster than the copy-make for a bitboard stack. Because the typical make only alters two memory locations (two copy from board to 'undoInfo', plus two stores to board, or for unmake two copies from undoInfo to board). While copying the bitboard stack alone takes eight copy operations (white, black plus 6 piece types, or perhaps even 12 if you separate by color).
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: engine speed issues

Post by mvanthoor »

hgm wrote: Fri Jul 22, 2022 10:08 am That is for single threaded?
Yes.
And do you use futility (delta) pruning? Such pruning usually speeds up the engine a lot (time-to-depth-wise) by pruning about half the QS nodes, but at the same time lowers the nps a lot.
No. I'll see what happens when I get there. If NPS is lowered or not depends on where you count. If you count every node you hit, even if you then exit it because of pruning, your NPS won't drop. That is the reason why I count every node I hit, even if it is then immediately exited. It existed, the engine visited it, decided it doesn't need to do anything with it, but it does count it.
Because all the nodes you prune were nodes where you basically did nothing else than incrementing the node counter, so that they are almost infinitely fast. (You would stand pat there on the incremental evaluation score, even before you generated any moves.)
Correct; as said, I'm of the opinion that if an engine creates a node, enters it, and then decides (for whatever pruning reason) "Nah, not going to bother", it should STILL be counted.

Of course, if you only count the nodes you're doing actual work in, such as generating and then searching moves (= put your nodes++ after the pruning instead of before it), then each new pruning option will drop your NPS. That is another reason why I decided to count at the top of the A/B and QS functions (and take into account that I don't count an A/B node and then immediately count it again when dropping into QS); that way pruning does not lower the engine's NPS; in fact, it should increase it. That seems logical to me, because I hit the node, count it, decide that I don't want or need to do anything with it, and I'm done. But the engine DID hit the node. The more pruning is done, the less work per node the engine needs to do (on average), and thus more nodes can be hit and NPS should increase.

The only time when NPS should decrease is if I actually do more work, such as:
- Add stuff to the evaluation
- Keep a second set of Zobrists for an opening book (I don't want to use the polyglot Zobrists for the main engine)
- Add a pawn hash table that needs to be probed and updated
- etc...

As far as I can see my philosophy is correct, but I can see the viewpoint of only counting nodes where actual work is done. That's another (also correct) philosophy for which one could argue.
hgm wrote: Fri Jul 22, 2022 10:32 am With mailbox make and unmake together are usually a lot faster than the copy-make for a bitboard stack. Because the typical make only alters two memory locations (two copy from board to 'undoInfo', plus two stores to board, or for unmake two copies from undoInfo to board). While copying the bitboard stack alone takes eight copy operations (white, black plus 6 piece types, or perhaps even 12 if you separate by color).
But a bitboard engine can perfectly well be make/unmake? Mine is; I don't copy complete board representations or sets of bitboards. Why would anyone even do a copy/make? It has to be slower by definition because you're copying lots of stuff that's not even changing.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

mvanthoor wrote: Fri Jul 22, 2022 11:12 amCorrect; as said, I'm of the opinion that if an engine creates a node, enters it, and then decides (for whatever pruning reason) "Nah, not going to bother", it should STILL be counted.

Of course, if you only count the nodes you're doing actual work in, such as generating and then searching moves (= put your nodes++ after the pruning instead of before it), then each new pruning option will drop your NPS. That is another reason why I decided to count at the top of the A/B and QS functions (and take into account that I don't count an A/B node and then immediately count it again when dropping into QS); that way pruning does not lower the engine's NPS; in fact, it should increase it. That seems logical to me, because I hit the node, count it, decide that I don't want or need to do anything with it, and I'm done. But the engine DID hit the node. The more pruning is done, the less work per node the engine needs to do (on average), and thus more nodes can be hit and NPS should increase.
But the point is that nodes that are pruned are not visited, and nodes that are visited are (apparently) not pruned. Futility (delta) pruning is a method to speed up the engine by eliminating predictably unnecessary make/unmakes to nodes that would immediately return without doing anything. But by doing that it would eliminate these very cheap nodes from the count, driving up the average time needed to process a node.
But a bitboard engine can perfectly well be make/unmake? Mine is; I don't copy complete board representations or sets of bitboards. Why would anyone even do a copy/make? It has to be slower by definition because you're copying lots of stuff that's not even changing.
Indeed it can, but I was reacting to Dangi's post, which said that bitboard was ideally suited for copy-make.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: engine speed issues

Post by mvanthoor »

hgm wrote: Fri Jul 22, 2022 1:21 pm But the point is that nodes that are pruned are not visited, and nodes that are visited are (apparently) not pruned. Futility (delta) pruning is a method to speed up the engine by eliminating predictably unnecessary make/unmakes to nodes that would immediately return without doing anything. But by doing that it would eliminate these very cheap nodes from the count, driving up the average time needed to process a node.
Oh, like that. Indeed. What I was thinking about would probably better be called an early exit; as in, the engine visits a node, determines that nothing needs to be done, and returns, much like the stand-pat in QS. You can count at the beginning of the function, which will count a stand-pat aborted node, or you can count after the stand-pat, to count only the nodes which actually do work.

I prefer to count all nodes, even if they are exited before doing anything.
Indeed it can, but I was reacting to Dangi's post, which said that bitboard was ideally suited for copy-make.
I see. Clear. IMHO copy/make would always be slower than make/unmake because you copy lots of unchanged stuff on every move.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

mvanthoor wrote: Fri Jul 22, 2022 3:40 pmI prefer to count all nodes, even if they are exited before doing anything.
Sure, I am not saying that one way is better than another. But it is important to know what exactly people are counting when you try to compare the raw speed of different engines. A stand-pat, especially based on a 'lazy eval' takes almost no time at all. So these nodes very much drive up the average speed. But when the engine uses futility pruning, it is exactly these nodes that are no longer visited.

I am still puzzled that you can achieve this speed with a hash table. I remember that when I added hash probing to micro-Max the speed dropped from 6Mnps to 2Mnps, just because the hash probes were so slow. So it seems the hash probes took 2/3 of the time, suggesting that these would limit you to 3Mnps if nothing else had to be done. Micro-Max' hash table might not be entirely optimal, though (no alignment with cache lines).

Do you probe the TT in every counted node, or do you test for stand-pat cutoff even before probing the hash?
User avatar
Bo Persson
Posts: 257
Joined: Sat Mar 11, 2006 8:31 am
Location: Malmö, Sweden
Full name: Bo Persson

Re: engine speed issues

Post by Bo Persson »

mvanthoor wrote: Fri Jul 22, 2022 3:40 pm
HGM wrote: Indeed it can, but I was reacting to Dangi's post, which said that bitboard was ideally suited for copy-make.
I see. Clear. IMHO copy/make would always be slower than make/unmake because you copy lots of unchanged stuff on every move.
This assumes that you have "lots of stuff" to copy. A set of bitmaps can be very compact. And some of them (half(?) for a capture) will be modified anyway. And then you modify them again in unmake.

Not doing anything in unmake seems like optimal for that part. :)
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

So the bottom line is that copy-make for bitboards might even be slightly faster than make-unmake, but still a lot slower than the make + unmake updates of a mailbox board, for which copy-make would disastrously slow.

Of course this still tells only part of the story, because game states are often represented redundantly for efficiency's sake. Mailbord will suck if you don't have a piece list for quickly locating the wherabouts of a given piece, while bitboard engines usually also do maintain a mailbox board to quickly deduce which piece type is captured by a given move. So the actual difference would be the update of a piece list versus that of the bitboard stack (which is sort of a piece-type list). And the make-unmake update of a piece list is even faster than that of a bitboard: copy the to-square of the move to it on make, and the from-square on unmake. (Just two stores, as the square numbers pass through the registers for other reasons anyway.) And perhaps some effort to remove the captured piece.

Even here hybrid representations are possible: you can accompany a piece list with a piece-set: an int where each bit indicates whether a certain piece is present or not. This can be conveniently used to loop through the piece list without visiting any captured pieces, and would remove the need to actually alter the location of a captured piece in the piece list. You would just clear its bit in this 'presence mask'.

I can add that even for the bulky attack map that I used in The Mailbox Trials (32 64-bit words) copy-make turned out to be the fastest solution. This because so many elements of it get changed by a typical move.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: engine speed issues

Post by mvanthoor »

hgm wrote: Fri Jul 22, 2022 4:06 pm I am still puzzled that you can achieve this speed with a hash table. I remember that when I added hash probing to micro-Max the speed dropped from 6Mnps to 2Mnps, just because the hash probes were so slow. So it seems the hash probes took 2/3 of the time, suggesting that these would limit you to 3Mnps if nothing else had to be done. Micro-Max' hash table might not be entirely optimal, though (no alignment with cache lines).

Do you probe the TT in every counted node, or do you test for stand-pat cutoff even before probing the hash?
At this time I don't probe the TT in the QSearch. If I would, I'd probe it after the stand-pat.

In the A/B-method, I probe the TT immediately after all the early exits (such as exit on MAX_PLY, out of time, etc...)

The engine doesn't have any pruning of any kind yet, except A/B, obviously.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL