fast(er) movegen

Sven · Post by **Sven** » Tue Dec 11, 2018 12:34 am

flok wrote: ↑Mon Dec 10, 2018 3:00 pm The move generator of stockfish is ~14x faster than the one of embla (compared via perft 6).
So I'm thinking there's some room for improvement in Embla.

Currently Embla implements an 8x8 array of pointers. Those pointers point to the 2 x 16 piece-objects. Those objects calculate the moves for their object-type and state. This is then stored in an array.

There are several alternative implementations that I know of. Bitboards, mailbox, 0x88, etc.
Question now is: which is faster?
Strictly for (pseudo legal) move generation. I would like to keep search & eval as they are (for now).

You are talking about three very different points:
1) board representation and its impact on overall engine speed,
2) move generation speed in the context of tree search,
3) perft speed.

Of course 1) influences 2) and 3). But, despite your remark "strictly for (...) move generation", 1) also influences search and eval. The amount of changes required in search and eval code when switching to a different board representation depends on several factors, including the level of abstraction you have chosen to apply there. If your search and eval is written as "high level code" then you would not need many changes there to switch to bitboards. Basically the implementation of loops over all pieces (or all of a certain group, like all black knights) will be different for mailbox vs bitboards. (Nearby, in Jumbo I have implemented a "BoardIterator" class that serves as an abstraction of the board representation and is one of several elements allowing me to support two Jumbo variants with two different board representations: bitboards and 0x88, while keeping the whole search and most of the eval code common to both variants.)

Many programmers, basically all but one

, agree that bitboards as board representation are our current first choice regarding overall engine speed. For 2) and 3), as two very special "disciplines", I think it is possible that mailbox-based representations, like 0x88, or attack-map based ones, can result in a faster movegen + perft implementation if they are used in a clever way. But no chess engine is "only generating moves", and the speed of evaluation is much more important than movegen.

Other points that have already been mentioned in other replies are
- a proper implementation of legality checking and
- the use of "bulk counting" for perft.

Comparing perft speed of your engine to that of SF or other top engines should be postponed until these two have been achieved, at least, otherwise you are comparing the well-known apples to oranges.

Michael Sherwin · Post by **Michael Sherwin** » Tue Dec 11, 2018 6:41 am

Ratosh wrote: ↑Mon Dec 10, 2018 10:50 pm Oh, i assumed it since most top engines use bitboard. What would be the reason why they would chose it over other options?

if(!(wPassed[sq] & bPawns)) score += PASSED_PAWN_BONUS;

hgm · Post by **hgm** » Tue Dec 11, 2018 12:02 pm

I just wrote a generator for random games, and I solved the legality problem there as follows:

Basically it is just doing pseudo-legal move generation, before selecting one of the moves at random. It then just plays the move, and generates moves for the opponent. Which is the thing it would have to do anyway if the move was legal, which is usually the case. If the move generator stumbles on a capture of the King, it aborts, and the caller then takes back the illegal move, for selecting another one.

To make this efficient, the move generator passes back some information on the King capture, though. When a moving piece hits a King, and this King was not the previously moved piece, it reverses the direction of stepping, and marks every square on a 'checkRay' board with the current node counter (which counts MoveGens), up to and including the from-square. For convenience it also stores the location of the captured King.

The caller can then distinguish 3 cases:
1) The illegal move that bounced was a King move that stumbled into check. We just delete that move from the move list (filling the hole with the move that was at the end), and make a new random selection from the remaining list.
2) The illegal move started on the check ray. Apparently we moved a pinned piece. We remember its location. Before actually trying a move, we compare the from-square with this location of the pinned piece (set to an invalid square initially), to see if the checkRay board for that square equals the node counter. If it is the same, only moves that land on the marked check ray carry on to the MakeMove phase; others are immediately deleted, before we loop back to selecting another move.
3) If neither (1) nor (2) is the case, we must have been in check even before the move. As most moves will now not resolve that, discarding those one at the time by an opponent MoveGen will be very costly. So we first cleanse the move list once and for all of all moves that will certainly not help. Only moves with the King, or moves for which the to-square is marked on the checkRay board (which includes capture of the checker) will remain. Then we try again.

This continues until the a move isn't aborted, or the move list gets empty. In the latter case we are check- or stalemated; running the opponent MoveGen in the current position will tell us which is the case, and thus how the game ended.

Usually the failure of a move will tell us everything we need to know to make it almost certain that the next pick will give us a legal one. Only in the case of a double check the moves onto the check ray left after the cleansing due to the first check we encountered will still be illegal. A second cleansing will then remove all but the King moves. King moves, alas, have to be tried independently, so in case of a seiged King it could take a number of tries to find a legal evasion with it. If multiple pieces are pinned handling a move with the second would make us forget info about the first, but in general we must be really unlucky to pick the moves of the other pinned piece immediately after we happened to pick the first.

Ratosh · Post by **Ratosh** » Tue Dec 11, 2018 12:55 pm

Michael Sherwin wrote: ↑Tue Dec 11, 2018 6:41 am
Ratosh wrote: ↑Mon Dec 10, 2018 10:50 pm Oh, i assumed it since most top engines use bitboard. What would be the reason why they would chose it over other options?
if(!(wPassed[sq] & bPawns)) score += PASSED_PAWN_BONUS;

Thank you.

hgm · Post by **hgm** » Tue Dec 11, 2018 1:36 pm

Not really a relevant example, though, as everything to do with Pawns you would normally get for free from the Pawn hash table, and the hit rate on that would be so hight that it is completely immaterial how much time you have to spend for calculating stuff when you have a miss.

Ronald · Post by **Ronald** » Thu Dec 20, 2018 8:54 pm

I did an experiment with rofChade, in order to get some insight in the influence of the speed of the movegenerator. I created a loop around the pure movegeneration code, so that it can be executed multiple times, and thus simulate different movegenerator speeds.

With different loop multipliers I measured the (difference in) speed with perft and a “go” from the startposition. I also did a gauntlet (20 seconds
+0.2 per move) against the normal version to get some insight in the elo losses (in selfplay).

First some background info. In the movegenerator the pseudo legal moves(or legal when inCheck) are generated together with the MVV/LVA value of the moves.The perft routine executes a makeMove / unmakeMove for every move except for the leaf nodes, for which a movecount of legal moves is executed.Inside the makeMove the move is executed, (bit)boards are updated, together with the hash keys. Also an incremental update of material values and piece square values is done.

The test: With different multipliers of the movegen loop firstly a perft 6 was executed. A 2 times execution of the movegen loop doesn’t take exactly twice as much time, because of data that is already in cache the second time, but that doesn’t spoil the fun, by measuring the effects on perft we already get some feeling for the effect of the slower movegen. Usually when movegen is slower, makemove etc. will also be slower so maybe a better indicator for the speed of the movegen might be the speed of perft The used multipliers are 1x, 2x, 10x, and 24x.

The results are as follows:

Code: Select all

Perft 6 from opening position			
Moveloop	Time		Factor
1x		0,838358	100%	
2x		1,22373		146%
10x		3,84053		458%
24x		8,30599		991%

So, a doubling of the movegen loop makes the perft 6 nearly 50 % slower. With the multiplier 24 perft 6 is nearly 10 times slower than normal.

To look at the influence on a regular search , a "go depth 24" is executed from the startposition. The results are as follows:

Code: Select all

Opening position: go depth 24			
Moveloop	Time		Slowdown	Elo difference
1x		12,41		0%		0
2x		13,544		9%		-11
10x		17,401		40%		-51
24x		23,696		91%		-91

In the last column the estimated elo loss with selfplay is shown. For every version 400 games where played.

With a multiplier of 2 the slowdown in the startposition is 9% and the calculated elo loss is 11. With 400 games played there still is an error margin of 13 so we can conclude that the difference is not much.
With multiplier 24, perft 6 was nearly 10 times slower, which resulted in a nearly doubling of the time for the search. So even with a 10 times slower perft the elo loss is slightly below 100.

Conclusion: As expected there is a gain with using a faster movegen, but it's not that much.

Afterthought: The latest Embla version I have is version 1.0.1. I ran a perft 6 with Embla and it took Embla 107.3 seconds to finish. That is 128 times slower than rofChade. I don't know how far the perft routines are comparable and if the current version takes the same amount of time, but maybe there is some serious elo gain for Embla with a faster movegen

Fulvio · Post by **Fulvio** » Sat Dec 22, 2018 5:24 pm

flok wrote: ↑Mon Dec 10, 2018 3:00 pm Currently Embla implements an 8x8 array of pointers.

You can try to use int8_t (representing indexes in an array) instead of pointers.
This way the board will be read with a single cache request ( https://stackoverflow.com/questions/392 ... lines-work ).
The ChessPiece objects should also be stored as close in memory as possible (for example in an array of 32 elements; the first 16 for white pieces and the next 16 for black pieces).

hgm · Post by **hgm** » Sat Dec 22, 2018 6:53 pm

If you want a fast mailbox engine it can also be handy to maintain a 'neighbor' board, which for every occupied square stores the square number of the nearest occupied square in each of the 8 directions. This will speed up capture generation, as you won't have to scan board rays to see what a slider hits, but can look up the target in a single table access.

It will of course generate some overhead to update this table, but this can be kept quite low by only doing it (in QS nodes) after move generation has produced a move that you actually want to search; if all captures are futile you can simply return a fail low without having to restore an update done in vain. The only downside is that while generating slider captures with the not-updated table you might actually hit a square that the previous move (assumed to also be a capture) would have evacuated. So you would have to test for that, and look up the next-nearest neighbor when it happens.

flok · Post by **flok** » Sun Dec 23, 2018 2:55 pm

Fulvio wrote: ↑Sat Dec 22, 2018 5:24 pm
flok wrote: ↑Mon Dec 10, 2018 3:00 pm Currently Embla implements an 8x8 array of pointers.
You can try to use int8_t (representing indexes in an array) instead of pointers.
This way the board will be read with a single cache request ( https://stackoverflow.com/questions/392 ... lines-work ).
The ChessPiece objects should also be stored as close in memory as possible (for example in an array of 32 elements; the first 16 for white pieces and the next 16 for black pieces).

True but it gives another level of indirection.

Ratosh · Post by **Ratosh** » Sun Dec 23, 2018 3:56 pm

Great info Ronald, i expected a bigger influence on slower move generation.

Thanks

fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen

Re: fast(er) movegen