Sure, it can be used. But Reversi moves typically change many squares. It is expensive to keep track of what (but at least you don't have to remember how; it always flips, except for the dropped stone). Copy-make might not be su h a bad idea there.
Move generation for bitboards
Moderator: Ras
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Move generation for bitboards
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Move generation for bitboards
Denser board is typically much faster than QBB because you dont use vertical compressed nibbles there. But it highly depends on how your move stuct is structured.pedrojdm2021 wrote: ↑Sat Feb 19, 2022 4:55 pm Are there a significant increase in performance using quad bitboards instead of 8 bitboards?
Denser board is what i'm talking about
https://www.chessprogramming.org/Bitboa ... nser_Board
As always - just test it. You can have any backing struct with the interface functions: WhiteQueen() BlackPawn() etc.. and just measure your perft nps.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 157
- Joined: Fri Apr 30, 2021 7:19 am
- Full name: Pedro Duran
Re: Move generation for bitboards
Great, thank you!dangi12012 wrote: ↑Sat Feb 19, 2022 7:34 pmDenser board is typically much faster than QBB because you dont use vertical compressed nibbles there. But it highly depends on how your move stuct is structured.pedrojdm2021 wrote: ↑Sat Feb 19, 2022 4:55 pm Are there a significant increase in performance using quad bitboards instead of 8 bitboards?
Denser board is what i'm talking about
https://www.chessprogramming.org/Bitboa ... nser_Board
As always - just test it. You can have any backing struct with the interface functions: WhiteQueen() BlackPawn() etc.. and just measure your perft nps.
-
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Move generation for bitboards
This does not work because the rightshift operator does not support negative operands and square could legally have a starting value smaller 4.
This is true. The context in which GetPiece() is usually called during move generation allows some simplifications based on what you already know must be true.hgm wrote: ↑Sat Feb 19, 2022 12:03 pm It also assumes that you would only apply it to a case where you already know the square contains an enemy piece, so that you only have to calculate type, not color. Note that the King board is not tested, so that a King would result in code = 0; this assumes there would be no confusion with empty squares, because you would never apply it to those either. I.e., you would intersect the bitboard of pseudo-legal moves for a piece with the enemy color bitboard first, and extract captures from that.
So instead of
Code: Select all
public Piece GetPiece(int square)
{
return (Piece)(Bit(Black | White, square, 0) |
Bit(White, square, 1) |
Bit(Pawns | Bishops | Queens, square, 2) |
Bit(Knights | Bishops | Kings, square, 3) |
Bit(Kings | Rooks | Queens, square, 4));
}
Code: Select all
public Piece GetCaptureTarget(int square)
{
return (Piece)(Bit(Pawns | Bishops | Queens, square, 2) |
Bit(Knights | Bishops, square, 3) |
Bit(Rooks | Queens, square, 4));
}
Code: Select all
Piece captured = GetCaptureTarget(square) | Piece.Black;
Now, let's talk about real world performance impacts measured in my tests:
For some reason that I don't fully understand searching even depths generates a lot more (4:1) captures than quiet moves. Unlike odd depths where the capture/quiet ratio is much more balanced (~3:2).
But I used that strange fact for my tests: On depth 8 over 300 test-positions there 2.3B captures generated and only 0.5B quiets. That's a lot of GetPiece() calls. The cumulative time to depth 8 is 240s (+/-) 2 seconds variance between runs.
And sadly all the difference that different GetPiece implementations were making where smaller than the variance so it's kinda hard to draw conclusions which implementation is best. Using GetCaptureTarget instead of GetPiece was a small win on average. But less <1% influence on the total runtime of the search.
I wasn't aware of the LEA instruction to be honest. I'm not usually thinking about machine code much tbh. So I tried it!
Performance wise it was competitive but not better. And looking at the generated machine code it isn't really shorter than my GetCaptureTarget() implementation especially as you only get a piece code from that calculation that you still have to convert into the correct Piece value via a lookup table. I don't really like lookup tables in C# because they either have to go to the stack or to the managed heap and both feels wrong! And it will perform range checks, too, that I don't really want.
You can look at the machine code here:
https://godbolt.org/z/E7xs89rEe
...and if you resolve the code via lookup table to Piece the methods get extended by this:
Code: Select all
call [CORINFO_HELP_READYTORUN_STATIC_BASE]
mov rax, gword ptr [rax]
cmp r14d, dword ptr [rax+8]
jae SHORT G_M28135_IG04
movsxd rdi, r14d
mov eax, dword ptr [rax+4*rdi+16]
add rsp, 8
pop rbx
pop r14
ret
G_M28135_IG04:
call [CORINFO_HELP_RNGCHKFAIL]
int3

...so I think it either comes down to continuing to use the GetPiece() method that I originally had. Or using a stripped down GetCaptureTarget version that uses less bitboards. No KING (because capturing a king doesn't happen), no BLACK and no WHITE boards because from the context you know whether you expect to capture a black or a white piece.
I have never implemented an Unmake method. Maybe it's not as hard as I think!hgm wrote: ↑Sat Feb 19, 2022 12:03 pm Even the relatively inefficient copy-make of a 64-slot mailbox board would only take 16 instructions (8 load, 8 store) for the copy. And 3 for the make (assuming the fromSqr, toSqr have already been brought into a register for the purpose of updating the bitboard stack, so just a load and two stores). And you would only do that for the moves that are actually searched.
The only merit of copy-make is that it saves you the unmake. But unmake of a normal move for a mailbox is very cheap: load fromSqr and toSqr, load and store the moving piece, and load and store the victim (from move to board). That is just 6 instructions.

No, seriously. That's a good suggestion. I'll first focus on establishing feature parity with MinimalChess and then give it a try.
Sometimes it's just convenient to have the old and the new boardstate around at the same time. So for example when updating the zobrist hash most of the relevant information is encoded in the move and so it doesn't have to be my Play() method that updates the zobrist hash but a seperate method. (That I can then just *not* call when in qsearch where I don't use TT lookups anyway)
But not everything you need is stored in the move. For example when making the move caused the CastleFlags to change or the EnPassant square is set you need the Zobrist hash to reflect that. And here it's just convenient to have the old and the new boardstate still available because you can do:
Code: Select all
//En passent & Castling
for (ulong bits = (CastleFlags ^ previous.CastleFlags) | (EnPassant ^ previous.EnPassant); bits != 0; bits = ClearLSB(bits))
ZobristHash ^= Zobrist.Flags(LSB(bits));
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Move generation for bitboards
Lithander you can use System.Numerics Vector256<uint64_t> for ultra fast code.
When structures are aligned in a 4xuint64_t way something like shift mask on 4 elements at once. You can also cast your existing memory without overhead in C# since C# 8.0 with this:
All the functions:
https://docs.microsoft.com/en-us/dotnet ... ew=net-6.0
I am bringing this up because vertical compressed nibbles can be converted to a piececode with 2 instrictions:
mask + System.Numerics.Intrsinsics.x64.AVX2.MoveMask which is _mm256_movemask_epi8.
Also with AVX2 Zobrist hashing is not needed IMO because there you can calculate a hash in two instruction without any lookups. (if your board structure supports it) I might do a in depth thread on that one (because its many x faster than zobrist and collisions are equally as unprobable)
https://docs.microsoft.com/en-us/dotnet ... 2-movemask
When structures are aligned in a 4xuint64_t way something like shift mask on 4 elements at once. You can also cast your existing memory without overhead in C# since C# 8.0 with this:
Code: Select all
MemoryMarshal.Cast<UInt64, Vector256<UInt64>>(data)
https://docs.microsoft.com/en-us/dotnet ... ew=net-6.0
I am bringing this up because vertical compressed nibbles can be converted to a piececode with 2 instrictions:
mask + System.Numerics.Intrsinsics.x64.AVX2.MoveMask which is _mm256_movemask_epi8.
Also with AVX2 Zobrist hashing is not needed IMO because there you can calculate a hash in two instruction without any lookups. (if your board structure supports it) I might do a in depth thread on that one (because its many x faster than zobrist and collisions are equally as unprobable)
https://docs.microsoft.com/en-us/dotnet ... 2-movemask
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 467
- Joined: Fri Dec 16, 2016 11:04 am
- Location: France
- Full name: Richard Delorme
Re: Move generation for bitboards
I am not sure to understand either. What is the quality of the hash you get? The problem is that a bitboard from a quad-bitboard or a dense-biitboard is not a random number, and multiplying it with a random number does not give a very good hashing number. With such a method I am not sure all the slots of a transposition table will be equally used. For example it is possible all of your hash number are even and when you use it as an index to access your table, you just fill half the slots. Did you test it on a real chess engine? Did you check the usage of the hash table you got?dangi12012 wrote: ↑Sat Feb 19, 2022 2:31 pmNo thats a misunderstanding - I said that hashing with QBB is two unconditional instructions and that is objectively superiour to zobrist - scoring that resulting position which corresponds to that hash is another matter entirely.
The cool thing with BB you can move multiple pieces at once with XOR for moves and even castling and EP unconditionally and impossibly fast to beat by any other method.
No, this is not something you necessarily need in a chess engine. I have no usage of that number in my chess engine Dumb for example.Also you want to know how many pieces of a type there are?
If you maintain a piece count number this will be as fast as bitboard and popcount. If you write a chess engine one day, be aware that people will ask you a version without popcount to run on their computer. I am not sure new ARM based mac have an efficient popcount instruction for example. Then your 0.25 cycle will be a 5 cycle long set of instructions.With BB and popcount it is a 0.25 clock cycle instruction. Maintaining a piececount list or looping over the 64 squares will be at least 10x slower.
Richard Delorme
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Move generation for bitboards
To some extent I think the chess engine development world is going to too much trouble trying to cater to everyone. I don't understand why I'm seeing people asking for versions that support super-old hardware like the first Core2Duo's from 2006, or sometimes even older. I can't believe anyone is still using something like that as their main computer.
Even engines supporting things such as BMI2 and PEXT already run on Haswell, from 2013, which is 8 years old now.
At the moment I'm providing these versions:
- ancient (for the very first consumer x64 cpu's such as the Opteron)
- old (for CPU's such as the Core2Duo)
- modern (for popcount-enabled CPU's such as Nehalem)
- BMI2 (newest version at the moment)
- PEXT (in the future at some point, when I add this functionality)
I need to provide those for Windows, Linux, the Intel Macs. Those are 15 versions. I'm not even counting the 32-bit version for Windows, the several Raspberry versions that oculd be created, and the version for the M1 Mac.
To be honest, I'm often thinking about just tossing everything, and just provide ancient and BMI2 / PEXT, because the speed difference between the slowest and fastest version is about 5%. (The one version that is clearly slower, at half the speed, is the Windows 32-bit version.)
-
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Move generation for bitboards
I've never worked with the AVX2 intrinsics yet. At some point I want to try to make NNUEs work in my C# engine and I guess that's the time where AVX2 will become truly relevant.dangi12012 wrote: ↑Tue Feb 22, 2022 1:43 pm Lithander you can use System.Numerics Vector256<uint64_t> for ultra fast code.
When structures are aligned in a 4xuint64_t way something like shift mask on 4 elements at once. You can also cast your existing memory without overhead in C# since C# 8.0 with this:All the functions:Code: Select all
MemoryMarshal.Cast<UInt64, Vector256<UInt64>>(data)
https://docs.microsoft.com/en-us/dotnet ... ew=net-6.0
I am bringing this up because vertical compressed nibbles can be converted to a piececode with 2 instrictions:
mask + System.Numerics.Intrsinsics.x64.AVX2.MoveMask which is _mm256_movemask_epi8.
Also with AVX2 Zobrist hashing is not needed IMO because there you can calculate a hash in two instruction without any lookups. (if your board structure supports it) I might do a in depth thread on that one (because its many x faster than zobrist and collisions are equally as unprobable)
https://docs.microsoft.com/en-us/dotnet ... 2-movemask
But I still wanted to try and understand what you were suggesting. I guess I got... maybe... half of it?
Code: Select all
struct Data
{
public ulong A;
public ulong B;
public ulong C;
public ulong D;
}
Span<ulong> = MemoryMarshal.CreateSpan(ref data.A, 4);
Vector256<byte> vector256 = MemoryMarshal.Cast<ulong, Vector256<byte>>(span)[0];
//???
int pieceCode = Avx2.MoveMask(vector256);
And then I can get a 32bit piece-code with Avx2.MoveMask(vector256)... but first I need to get the nibbles as you call them into the right slots. And it's operating on byte level not on the ulong level that I really need.
So... what is the other AVX2 instruction you had in mind that does the equivalent of shifting the 4 bitboards by square and then clears all the most significant bits that I'm not interested in?
And finally... how do I get that 32bit piececode converted to my Piece enumeration member efficiently? 32bits is too large for a lookup table.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Move generation for bitboards
I will send a code sample! Could not do it in compiler explorer 
To get started with intrinsics its good to read every tile once: https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf
Many operations do not work like one expects - and have pitfalls.
But you can expect multiply, shift, AND, OR, XOR and other instructions to be available for all vector sizes.
For chess most of the time it will be a vector256<ulong> like yours. If you have a struct like that you can even replace it fully by Vector256<ulong> and have static functions to modify that "struct"
You can think of AVX2 like doing normal operations but on 4-32 elements at once (and you dont get normal operators like + and - you have to write that out via the functions)

To get started with intrinsics its good to read every tile once: https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf
Many operations do not work like one expects - and have pitfalls.
But you can expect multiply, shift, AND, OR, XOR and other instructions to be available for all vector sizes.
For chess most of the time it will be a vector256<ulong> like yours. If you have a struct like that you can even replace it fully by Vector256<ulong> and have static functions to modify that "struct"
You can think of AVX2 like doing normal operations but on 4-32 elements at once (and you dont get normal operators like + and - you have to write that out via the functions)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer