You should not care too much on the perft speed, 1-2 millon of nps in perft is a quite good speed, now you should focus on making an optimized AI search, that's the hardest part i think, because it has so much stuff involved in it, evaluation, search features, extensions, reductions, move ordering, and so on..eligolf wrote: ↑Thu Feb 17, 2022 12:07 pm Short update for those interested:
- Move generation completed
- Make and unmake move functions done
- All perfts from https://www.chessprogramming.org/Perft_Results completed (with depths that gives nodes < 50.000.000 or so
- Nodes/s is around 1-2 million right now.
Should I try to make the move generation and make/unmake move functions faster than this before continuing with next phases of incorporating the engine itself? Or is this good enough in C# to continue?
Thanks all for the support so far, the helpfullness of people here always amazes me. I hope to one day be able to help people too here when being more experienced![]()
Move generation for bitboards
Moderator: Ras
-
- Posts: 157
- Joined: Fri Apr 30, 2021 7:19 am
- Full name: Pedro Duran
Re: Move generation for bitboards
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Move generation for bitboards
Such copying would typically be done one 8-byte memory word at a time, when using the memcpy function, or when you specify it by hand (e.g. a union of an array of char and of uint64_t), or automatically by a self-respecting compiler. So yes, it should be faster. (Just by one store + load, of course, but definitely not a reason to prefer the slower one.) When *using* the array, you of course only use the relevant element. Even on a 100x100 board a mailbox MakeMove would not be any slower as on 8x8. But now try MakeMove for a 10,00 square bitboard...lithander wrote: ↑Thu Feb 17, 2022 2:25 pm Does copying 9 ulong take the same amount of time as copying 72 bytes? Probably, if you can copy the 72 bytes as a memory block. Looping over an array of 72 bytes, incrementing and index and then copying each byte value by value is something else entirely. And that's exactly the problem - not only when copying but when *using* the array in practice.
Well, in microMax it would be all. But I think we are getting into a case of 'derailed thinking' here. You either care about speed, or you don't. If you care, you would not use copy-make on an 8x8 mailbox, because it is quite slow (even when copying in chunks of 8) compared to the incremental make-unmake (which is only a single copy of the moving piece on make, and two (restoring mover and victim) on unmake). With just a mailbox board looping over your own pieces is a performance spoiler, so normally you would combine it with a piece list. (Which also allows for very fast make-unmake.) But that would be pointless if you would update that piece list by copy-make, as the copy-make would likely take more time that the piece list saves you in looping over your pieces. Of course the looping can also be sped up by maintaing white & black bitboards next to your mailbox array. That is 16 bytes extra.But it's not really an either/or thing, right? You wouldn't use just the mailbox array (without bitboards and piece lists etc) and expect it to be as fast as a bitboard-only engine like mine? In the end I don't know how much data you have to copy for a practical engine but it's not just 64 bytes I'm sure. Except that in a bitboard only approach like QBB takes it you can get away with less.
Well, so you are just moving the the code to do the same task elsewhere. That doesn't make it any faster. Point is that for each capture you are looping through 6 bitboards to find what occupies the to-square, using branches that are usually mispredicted. (I guess there would be a faster, branchless way by extracting all the 6 to-square bits, packing those into a 6-bit integer, and use that as an index in a 64-byte table to get the piece type.) That takes many times the time it would take to update (and restore) an auxiliary mailbox board, plus just fetching the victim from its to-square element. But of course most moves would not be made at all. On those moves the only work would be the access to pieceType[toSqr] during move generation.The move knows which piece was captured! But updating the zobrist hash has nothing to do with the bitboards that store the pieces. All you need is the old hash and information that is already stored in the move.
Okay... how does the move generator know the type of the captured piece without looking at the bitboards. Yes... it looks. It has to touch 6 + 1 bitboards. And it's not pretty but it's also not big enough an issue that I'd need a mailbox array to speed it up. The bitboards are in the cache as the move generator is using them just now to generate moves. Also I need the information in the moves for MVV-LVA sorting anyway even before playing these moves so that's not driving up the cost of the incremental updates.
The mailbox board would of course belong to the core data that is always cached too. It is even smaller than your bitboard stack.
I really don't want to claim that this is the best way to do it. I didn't read tutorials or study engines. I just wrote something mostly from scratch and if some of my reinvented wheels are squares... well... I guess I had some fun exploring off the beaten path nonetheless. And really I think 60-70M nps in perft with a pseudo legal movegen (so you have to invoke IsSquareAttacked on every node in addition to generating the move and playing it) is pretty good for a C# engine. So most wheels must be fairly round.
Of course now that I add features to the search the speed goes down but I don't think it's due to the board representation. Otherwise... maybe in my 3rd engine I'll get everything right.![]()
-
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Move generation for bitboards
So my Piece type uses 5 bits. By "OR"ing different bitboards together masking with the square I want to get the piece from and shifting the result to the the right slot I can create the correct Piece value directly from the bitboards. No looping and no branch prediction.hgm wrote: ↑Fri Feb 18, 2022 7:40 pm Point is that for each capture you are looping through 6 bitboards to find what occupies the to-square, using branches that are usually mispredicted. (I guess there would be a faster, branchless way by extracting all the 6 to-square bits, packing those into a 6-bit integer, and use that as an index in a 64-byte table to get the piece type.)
Code: Select all
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public Piece GetPiece(int square)
{
/*
2nd Bit = White or Black? (implies the Piece bit to be set to)
Black = 1, //01
White = 3, //11
3rd+ Bits = Type of Piece
Pawn = 4, //00100
Knight = 8, //01000
Bishop = 12, //01100
Rook = 16, //10000
Queen = 20, //10100
King = 24, //11000
*/
return (Piece)(Bit(Black | White, square, 0) |
Bit(White, square, 1) |
Bit(Pawns | Bishops | Queens, square, 2) |
Bit(Knights | Bishops | Kings, square, 3) |
Bit(Kings | Rooks | Queens, square, 4));
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private int Bit(ulong bb, int square, int shift) => (int)((bb >> square) & 1) << shift;

I will first re-implement the features of MinimalChess (which also used Copy/Make but with mailbox) in Leorik as I'm curious to compare the speed and strength of both engines at that point where they are most similar and then I'm going to implement an unmake method and an auxiliary mailbox and see what difference that makes. You have me convinced that it's at least worth a try!
Thanks for the detailed explanation!
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Move generation for bitboards
Perhaps a faster way to do this is:
code = Pawns >> square & 1;
code |= Knights >> --square & 2;
code |= Bishops >> --square & 4;
code |= Rooks >> --square & 8;
code |= Queens >> --square & 16;
return piece[ code];
That replaces all the ORing you do with the bitboards to get a specific code for each piece by just a single lookup at the end. It also avoids using two shifts per bitboard, replacing one of those by a decrement. (CPUs typically can do 3 logical operations per clock, but only a single shift.) It also assumes that you would only apply it to a case where you already know the square contains an enemy piece, so that you only have to calculate type, not color. Note that the King board is not tested, so that a King would result in code = 0; this assumes there would be no confusion with empty squares, because you would never apply it to those either. I.e., you would intersect the bitboard of pseudo-legal moves for a piece with the enemy color bitboard first, and extract captures from that. While the non-captures would be extracted in a separate loop after intersecting with ~Occupied.
I guess that you could do even better on i386 or amd64 architecture, making use of the 'load effective address' instruction that does A = B + 2*C (with registers A, B and C):
code = Pawns >> square & 1;
code = 2*code + Knights >> square & 1;
code = 2*code + Bishops >> square & 1;
code = 2*code + Rooks >> square & 1;
code = 2*code + Queens >> square & 1;
return piece[ code];
I still count 20 instructions there. Which you would have to do for every capture. Even the relatively inefficient copy-make of a 64-slot mailbox board would only take 16 instructions (8 load, 8 store) for the copy. And 3 for the make (assuming the fromSqr, toSqr have already been brought into a register for the purpose of updating the bitboard stack, so just a load and two stores). And you would only do that for the moves that are actually searched.
The only merit of copy-make is that it saves you the unmake. But unmake of a normal move for a mailbox is very cheap: load fromSqr and toSqr, load and store the moving piece, and load and store the victim (from move to board). That is just 6 instructions.
I am not trying to argue that copy-make is always bad. I used it myself for the attack map in The Mailbox Trials. This map was also a mailbox structure (i.e. an array of individually accessible elements, but not a board indexed by square number, but a piece list indexed by victim), and quite large (as the elements were ulong). The problem there was that a 'make' typically updated many elements of the array, and in an irreversible way. So that you would have to remember which elements were changed, and what their old value was. Storing and retrieving that information proved just as bad as the copy-make: per changed element I had to load the old value and store it somehwere, as well as the index. And during unmake load both again and store back. Both with some loop overhead as well (decrement index and branch). So that is 10 instructions per changed element, while copying the element unconditionally before make was just 2 instructions.
Mailbox on a game like Reversi/Othello would not be nearly such a good idea as for Chess...
code = Pawns >> square & 1;
code |= Knights >> --square & 2;
code |= Bishops >> --square & 4;
code |= Rooks >> --square & 8;
code |= Queens >> --square & 16;
return piece[ code];
That replaces all the ORing you do with the bitboards to get a specific code for each piece by just a single lookup at the end. It also avoids using two shifts per bitboard, replacing one of those by a decrement. (CPUs typically can do 3 logical operations per clock, but only a single shift.) It also assumes that you would only apply it to a case where you already know the square contains an enemy piece, so that you only have to calculate type, not color. Note that the King board is not tested, so that a King would result in code = 0; this assumes there would be no confusion with empty squares, because you would never apply it to those either. I.e., you would intersect the bitboard of pseudo-legal moves for a piece with the enemy color bitboard first, and extract captures from that. While the non-captures would be extracted in a separate loop after intersecting with ~Occupied.
I guess that you could do even better on i386 or amd64 architecture, making use of the 'load effective address' instruction that does A = B + 2*C (with registers A, B and C):
code = Pawns >> square & 1;
code = 2*code + Knights >> square & 1;
code = 2*code + Bishops >> square & 1;
code = 2*code + Rooks >> square & 1;
code = 2*code + Queens >> square & 1;
return piece[ code];
I still count 20 instructions there. Which you would have to do for every capture. Even the relatively inefficient copy-make of a 64-slot mailbox board would only take 16 instructions (8 load, 8 store) for the copy. And 3 for the make (assuming the fromSqr, toSqr have already been brought into a register for the purpose of updating the bitboard stack, so just a load and two stores). And you would only do that for the moves that are actually searched.
The only merit of copy-make is that it saves you the unmake. But unmake of a normal move for a mailbox is very cheap: load fromSqr and toSqr, load and store the moving piece, and load and store the victim (from move to board). That is just 6 instructions.
I am not trying to argue that copy-make is always bad. I used it myself for the attack map in The Mailbox Trials. This map was also a mailbox structure (i.e. an array of individually accessible elements, but not a board indexed by square number, but a piece list indexed by victim), and quite large (as the elements were ulong). The problem there was that a 'make' typically updated many elements of the array, and in an irreversible way. So that you would have to remember which elements were changed, and what their old value was. Storing and retrieving that information proved just as bad as the copy-make: per changed element I had to load the old value and store it somehwere, as well as the index. And during unmake load both again and store back. Both with some loop overhead as well (decrement index and branch). So that is 10 instructions per changed element, while copying the element unconditionally before make was just 2 instructions.
Mailbox on a game like Reversi/Othello would not be nearly such a good idea as for Chess...
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Move generation for bitboards
I never had a situation where I wanted to know which piece I took - because for hashing I am using a quadboard which just multiplies with a fixed random AVX2 register (32x8bit) and horizontal XORs those bits together to get a good hash in two clock cycles. (probably that is even faster than a lookup into the big zobrist table to be honest and with the correct 16 piece encoding you even get castling and EP hashed in there)hgm wrote: ↑Sat Feb 19, 2022 12:03 pm Perhaps a faster way to do this is:
code = Pawns >> square & 1;
code |= Knights >> --square & 2;
code |= Bishops >> --square & 4;
code |= Rooks >> --square & 8;
code |= Queens >> --square & 16;
return piece[ code];
Mailbox on a game like Reversi/Othello would not be nearly such a good idea as for Chess...
If you were really keen on knowing which piece you just took because your algo demands it in a bitboard situation you can either have a 6x64 board and directly mask compare with two avx2 instructions - or a quadbard and there you can literally just mask the bit and _movemask_epi8 to get the piececode in two intructions.
Array access is something to be avoided - and if avoided - the gains are often bigger than expected. The lookup cost is comparable to a conditional branch if the index needs to be calculated.
Last edited by dangi12012 on Sat Feb 19, 2022 2:30 pm, edited 5 times in total.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Move generation for bitboards
Well, if your chess engine cannot see the difference between capturing a Queen and capturing a Pawn, I guess it isn't very strong. 

-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Move generation for bitboards
No thats a misunderstanding - I said that hashing with QBB is two unconditional instructions and that is objectively superiour to zobrist - scoring that resulting position which corresponds to that hash is another matter entirely.
The cool thing with BB you can move multiple pieces at once with XOR for moves and even castling and EP unconditionally and impossibly fast to beat by any other method.
Also you want to know how many pieces of a type there are? With BB and popcount it is a 0.25 clock cycle instruction. Maintaining a piececount list or looping over the 64 squares will be at least 10x slower.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Move generation for bitboards
The problem is that you don't just want to know how many pieces there are of one type, but of all 10 types. And also where the are. The evaluation requires you to know SUM PST[piece][location[piece]] over all 32 pieces. Calculating that from scratch would be 10-100 times slower than the hand full of instructions that would be needed to keep track of it incrementally.
-
- Posts: 467
- Joined: Fri Dec 16, 2016 11:04 am
- Location: France
- Full name: Richard Delorme
Re: Move generation for bitboards
Bitboards work great for Reversi/Othello but a mailbox is not that bad. Many strong Othello programs like Logistello used a mailbox.
I guess for bigger board size like 10x10 or more, having a mailbox representation is simpler also.
Richard Delorme
-
- Posts: 157
- Joined: Fri Apr 30, 2021 7:19 am
- Full name: Pedro Duran
Re: Move generation for bitboards
Are there a significant increase in performance using quad bitboards instead of 8 bitboards?
Denser board is what i'm talking about
https://www.chessprogramming.org/Bitboa ... nser_Board
Denser board is what i'm talking about
https://www.chessprogramming.org/Bitboa ... nser_Board