Move generation for bitboards

hgm · Post by **hgm** » Sat Feb 19, 2022 7:32 pm

abulmo2 wrote: ↑Sat Feb 19, 2022 4:48 pm
hgm wrote: ↑Sat Feb 19, 2022 12:03 pm Mailbox on a game like Reversi/Othello would not be nearly such a good idea as for Chess...
Bitboards work great for Reversi/Othello but a mailbox is not that bad. Many strong Othello programs like Logistello used a mailbox.
I guess for bigger board size like 10x10 or more, having a mailbox representation is simpler also.

Sure, it can be used. But Reversi moves typically change many squares. It is expensive to keep track of what (but at least you don't have to remember how; it always flips, except for the dropped stone). Copy-make might not be su h a bad idea there.

dangi12012 · Post by **dangi12012** » Sat Feb 19, 2022 7:34 pm

pedrojdm2021 wrote: ↑Sat Feb 19, 2022 4:55 pm Are there a significant increase in performance using quad bitboards instead of 8 bitboards?

Denser board is what i'm talking about
https://www.chessprogramming.org/Bitboa ... nser_Board

Denser board is typically much faster than QBB because you dont use vertical compressed nibbles there. But it highly depends on how your move stuct is structured.
As always - just test it. You can have any backing struct with the interface functions: WhiteQueen() BlackPawn() etc.. and just measure your perft nps.

pedrojdm2021 · Post by **pedrojdm2021** » Sun Feb 20, 2022 4:14 am

dangi12012 wrote: ↑Sat Feb 19, 2022 7:34 pm
pedrojdm2021 wrote: ↑Sat Feb 19, 2022 4:55 pm Are there a significant increase in performance using quad bitboards instead of 8 bitboards?

Denser board is what i'm talking about
https://www.chessprogramming.org/Bitboa ... nser_Board
Denser board is typically much faster than QBB because you dont use vertical compressed nibbles there. But it highly depends on how your move stuct is structured.
As always - just test it. You can have any backing struct with the interface functions: WhiteQueen() BlackPawn() etc.. and just measure your perft nps.

Great, thank you!

lithander · Post by **lithander** » Tue Feb 22, 2022 1:23 pm

hgm wrote: ↑Sat Feb 19, 2022 12:03 pm Perhaps a faster way to do this is:

code = Pawns >> square & 1;
code |= Knights >> --square & 2;
code |= Bishops >> --square & 4;
code |= Rooks >> --square & 8;
code |= Queens >> --square & 16;
return piece[ code];

This does not work because the rightshift operator does not support negative operands and square could legally have a starting value smaller 4.

hgm wrote: ↑Sat Feb 19, 2022 12:03 pm It also assumes that you would only apply it to a case where you already know the square contains an enemy piece, so that you only have to calculate type, not color. Note that the King board is not tested, so that a King would result in code = 0; this assumes there would be no confusion with empty squares, because you would never apply it to those either. I.e., you would intersect the bitboard of pseudo-legal moves for a piece with the enemy color bitboard first, and extract captures from that.

This is true. The context in which GetPiece() is usually called during move generation allows some simplifications based on what you already know must be true.

So instead of

Code: Select all

public Piece GetPiece(int square)
{
	return (Piece)(Bit(Black | White, square, 0) |
				   Bit(White, square, 1) |
				   Bit(Pawns | Bishops | Queens, square, 2) |
				   Bit(Knights | Bishops | Kings, square, 3) |
				   Bit(Kings | Rooks | Queens, square, 4));
}

...I can also do just do:

Code: Select all

public Piece GetCaptureTarget(int square)
{
    return (Piece)(Bit(Pawns | Bishops | Queens, square, 2) |
                   Bit(Knights | Bishops, square, 3) |
                   Bit(Rooks | Queens, square, 4));
}

and use it like this

Code: Select all

Piece captured = GetCaptureTarget(square) | Piece.Black;

if I'm generating white captures.

Now, let's talk about real world performance impacts measured in my tests:

For some reason that I don't fully understand searching even depths generates a lot more (4:1) captures than quiet moves. Unlike odd depths where the capture/quiet ratio is much more balanced (~3:2).
But I used that strange fact for my tests: On depth 8 over 300 test-positions there 2.3B captures generated and only 0.5B quiets. That's a lot of GetPiece() calls. The cumulative time to depth 8 is 240s (+/-) 2 seconds variance between runs.

And sadly all the difference that different GetPiece implementations were making where smaller than the variance so it's kinda hard to draw conclusions which implementation is best. Using GetCaptureTarget instead of GetPiece was a small win on average. But less <1% influence on the total runtime of the search.

hgm wrote: ↑Sat Feb 19, 2022 12:03 pm code = Pawns >> square & 1;
code = 2*code + Knights >> square & 1;
code = 2*code + Bishops >> square & 1;
code = 2*code + Rooks >> square & 1;
code = 2*code + Queens >> square & 1;
return piece[ code];

I wasn't aware of the LEA instruction to be honest. I'm not usually thinking about machine code much tbh. So I tried it!

Performance wise it was competitive but not better. And looking at the generated machine code it isn't really shorter than my GetCaptureTarget() implementation especially as you only get a piece code from that calculation that you still have to convert into the correct Piece value via a lookup table. I don't really like lookup tables in C# because they either have to go to the stack or to the managed heap and both feels wrong! And it will perform range checks, too, that I don't really want.

You can look at the machine code here:
https://godbolt.org/z/E7xs89rEe

...and if you resolve the code via lookup table to Piece the methods get extended by this:

Code: Select all

       call     [CORINFO_HELP_READYTORUN_STATIC_BASE]
       mov      rax, gword ptr [rax]
       cmp      r14d, dword ptr [rax+8]
       jae      SHORT G_M28135_IG04
       movsxd   rdi, r14d
       mov      eax, dword ptr [rax+4*rdi+16]
       add      rsp, 8
       pop      rbx
       pop      r14
       ret      
G_M28135_IG04:
       call     [CORINFO_HELP_RNGCHKFAIL]
       int3

So, when you look at my C# code and wonder why I'm not using a table but arithmetics. That's the reason!

...so I think it either comes down to continuing to use the GetPiece() method that I originally had. Or using a stripped down GetCaptureTarget version that uses less bitboards. No KING (because capturing a king doesn't happen), no BLACK and no WHITE boards because from the context you know whether you expect to capture a black or a white piece.

hgm wrote: ↑Sat Feb 19, 2022 12:03 pm Even the relatively inefficient copy-make of a 64-slot mailbox board would only take 16 instructions (8 load, 8 store) for the copy. And 3 for the make (assuming the fromSqr, toSqr have already been brought into a register for the purpose of updating the bitboard stack, so just a load and two stores). And you would only do that for the moves that are actually searched.

The only merit of copy-make is that it saves you the unmake. But unmake of a normal move for a mailbox is very cheap: load fromSqr and toSqr, load and store the moving piece, and load and store the victim (from move to board). That is just 6 instructions.

I have never implemented an Unmake method. Maybe it's not as hard as I think!

No, seriously. That's a good suggestion. I'll first focus on establishing feature parity with MinimalChess and then give it a try.

hgm wrote: ↑Sat Feb 19, 2022 12:03 pm I am not trying to argue that copy-make is always bad.

Sometimes it's just convenient to have the old and the new boardstate around at the same time. So for example when updating the zobrist hash most of the relevant information is encoded in the move and so it doesn't have to be my Play() method that updates the zobrist hash but a seperate method. (That I can then just *not* call when in qsearch where I don't use TT lookups anyway)

But not everything you need is stored in the move. For example when making the move caused the CastleFlags to change or the EnPassant square is set you need the Zobrist hash to reflect that. And here it's just convenient to have the old and the new boardstate still available because you can do:

Code: Select all

            //En passent & Castling
            for (ulong bits = (CastleFlags ^ previous.CastleFlags) | (EnPassant ^ previous.EnPassant); bits != 0; bits = ClearLSB(bits))
                ZobristHash ^= Zobrist.Flags(LSB(bits));

dangi12012 · Post by **dangi12012** » Tue Feb 22, 2022 1:43 pm

Lithander you can use System.Numerics Vector256<uint64_t> for ultra fast code.
When structures are aligned in a 4xuint64_t way something like shift mask on 4 elements at once. You can also cast your existing memory without overhead in C# since C# 8.0 with this:

Code: Select all

MemoryMarshal.Cast<UInt64, Vector256<UInt64>>(data)

All the functions:
https://docs.microsoft.com/en-us/dotnet ... ew=net-6.0

I am bringing this up because vertical compressed nibbles can be converted to a piececode with 2 instrictions:
mask + System.Numerics.Intrsinsics.x64.AVX2.MoveMask which is _mm256_movemask_epi8.

Also with AVX2 Zobrist hashing is not needed IMO because there you can calculate a hash in two instruction without any lookups. (if your board structure supports it) I might do a in depth thread on that one (because its many x faster than zobrist and collisions are equally as unprobable)

https://docs.microsoft.com/en-us/dotnet ... 2-movemask

abulmo2 · Post by **abulmo2** » Tue Feb 22, 2022 3:59 pm

dangi12012 wrote: ↑Sat Feb 19, 2022 2:31 pm
hgm wrote: ↑Sat Feb 19, 2022 2:29 pm Well, if your chess engine cannot see the difference between capturing a Queen and capturing a Pawn, I guess it isn't very strong.
No thats a misunderstanding - I said that hashing with QBB is two unconditional instructions and that is objectively superiour to zobrist - scoring that resulting position which corresponds to that hash is another matter entirely.
The cool thing with BB you can move multiple pieces at once with XOR for moves and even castling and EP unconditionally and impossibly fast to beat by any other method.

I am not sure to understand either. What is the quality of the hash you get? The problem is that a bitboard from a quad-bitboard or a dense-biitboard is not a random number, and multiplying it with a random number does not give a very good hashing number. With such a method I am not sure all the slots of a transposition table will be equally used. For example it is possible all of your hash number are even and when you use it as an index to access your table, you just fill half the slots. Did you test it on a real chess engine? Did you check the usage of the hash table you got?

Also you want to know how many pieces of a type there are?

No, this is not something you necessarily need in a chess engine. I have no usage of that number in my chess engine Dumb for example.

With BB and popcount it is a 0.25 clock cycle instruction. Maintaining a piececount list or looping over the 64 squares will be at least 10x slower.

If you maintain a piece count number this will be as fast as bitboard and popcount. If you write a chess engine one day, be aware that people will ask you a version without popcount to run on their computer. I am not sure new ARM based mac have an efficient popcount instruction for example. Then your 0.25 cycle will be a 5 cycle long set of instructions.

mvanthoor · Post by **mvanthoor** » Tue Feb 22, 2022 4:42 pm

abulmo2 wrote: ↑Tue Feb 22, 2022 3:59 pm ]If you maintain a piece count number this will be as fast as bitboard and popcount. If you write a chess engine one day, be aware that people will ask you a version without popcount to run on their computer.

To some extent I think the chess engine development world is going to too much trouble trying to cater to everyone. I don't understand why I'm seeing people asking for versions that support super-old hardware like the first Core2Duo's from 2006, or sometimes even older. I can't believe anyone is still using something like that as their main computer.

Even engines supporting things such as BMI2 and PEXT already run on Haswell, from 2013, which is 8 years old now.

At the moment I'm providing these versions:
- ancient (for the very first consumer x64 cpu's such as the Opteron)
- old (for CPU's such as the Core2Duo)
- modern (for popcount-enabled CPU's such as Nehalem)
- BMI2 (newest version at the moment)
- PEXT (in the future at some point, when I add this functionality)

I need to provide those for Windows, Linux, the Intel Macs. Those are 15 versions. I'm not even counting the 32-bit version for Windows, the several Raspberry versions that oculd be created, and the version for the M1 Mac.

To be honest, I'm often thinking about just tossing everything, and just provide ancient and BMI2 / PEXT, because the speed difference between the slowest and fastest version is about 5%. (The one version that is clearly slower, at half the speed, is the Windows 32-bit version.)

lithander · Post by **lithander** » Wed Feb 23, 2022 12:36 pm

dangi12012 wrote: ↑Tue Feb 22, 2022 1:43 pm Lithander you can use System.Numerics Vector256<uint64_t> for ultra fast code.
When structures are aligned in a 4xuint64_t way something like shift mask on 4 elements at once. You can also cast your existing memory without overhead in C# since C# 8.0 with this:
Code: Select all
MemoryMarshal.Cast<UInt64, Vector256<UInt64>>(data)
All the functions:
https://docs.microsoft.com/en-us/dotnet ... ew=net-6.0

I am bringing this up because vertical compressed nibbles can be converted to a piececode with 2 instrictions:
mask + System.Numerics.Intrsinsics.x64.AVX2.MoveMask which is _mm256_movemask_epi8.

Also with AVX2 Zobrist hashing is not needed IMO because there you can calculate a hash in two instruction without any lookups. (if your board structure supports it) I might do a in depth thread on that one (because its many x faster than zobrist and collisions are equally as unprobable)

https://docs.microsoft.com/en-us/dotnet ... 2-movemask

I've never worked with the AVX2 intrinsics yet. At some point I want to try to make NNUEs work in my C# engine and I guess that's the time where AVX2 will become truly relevant.

But I still wanted to try and understand what you were suggesting. I guess I got... maybe... half of it?

Code: Select all

struct Data
{
        public ulong A;
        public ulong B;
        public ulong C;
        public ulong D;
}

Span<ulong> = MemoryMarshal.CreateSpan(ref data.A, 4);
Vector256<byte> vector256 = MemoryMarshal.Cast<ulong, Vector256<byte>>(span)[0];
//???
int pieceCode = Avx2.MoveMask(vector256);

So first I have to create a span from my struct. Than cast that span to one of Vector256<byte> so that I can apply AVX2 instructions on it.
And then I can get a 32bit piece-code with Avx2.MoveMask(vector256)... but first I need to get the nibbles as you call them into the right slots. And it's operating on byte level not on the ulong level that I really need.

So... what is the other AVX2 instruction you had in mind that does the equivalent of shifting the 4 bitboards by square and then clears all the most significant bits that I'm not interested in?

And finally... how do I get that 32bit piececode converted to my Piece enumeration member efficiently? 32bits is too large for a lookup table.

dangi12012 · Post by **dangi12012** » Wed Feb 23, 2022 6:46 pm

I will send a code sample! Could not do it in compiler explorer

To get started with intrinsics its good to read every tile once: https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf
Many operations do not work like one expects - and have pitfalls.
But you can expect multiply, shift, AND, OR, XOR and other instructions to be available for all vector sizes.

For chess most of the time it will be a vector256<ulong> like yours. If you have a struct like that you can even replace it fully by Vector256<ulong> and have static functions to modify that "struct"

You can think of AVX2 like doing normal operations but on 4-32 elements at once (and you dont get normal operators like + and - you have to write that out via the functions)

Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards

Re: Move generation for bitboards