Bitboards ?: still confused where the time savings is

Discussion of chess software programming and technical issues.

Moderator: Ras

Mergi
Posts: 127
Joined: Sat Aug 21, 2021 9:55 pm
Full name: Jen

Re: Bitboards ?: still confused where the time savings is

Post by Mergi »

dangi12012 wrote: Sat Nov 06, 2021 2:47 pm
What the huge advantage is: When you change N squares in a mailslot you have to index every square sequentially. When you change N squares in a bitboard it is a single x64 instruction.
That allowes move pruning and check evasion and pins - and nasty pins like the horizontal EP pins to never take more than 2 instructions to be resolved. And all that with less space and time compared to arrays.

Not that its useful but my prime example is : Moving all 8 pawns up one rank on an empty board:

Code: Select all

Wpawns <<= 8;
Checking how many are ready to be promoted now:

Code: Select all

int promo = BitCount(WPawns & LastRank);
Now how would that look in mailslot? I am asking because I am sure it would be more than 3 instructions long for 8 pawns and 8 checks for all 8 promotion squares :)
The code you provided is like the absolute basics of any bitboard engine. The slow part is generating and storing the moves. I would be very curious if you could go into more detail about how you are able to avoid that and perhaps provide some pseudo code as well.

Code: Select all

// after generating the possible_moves bitboard with one or two instructions, this is the slow part
while(possible_moves) {
	Square to_square = PopLsb(possible_moves);
	Square from_square = to_square - push_dir;
	PutMoveToList(to_square, from_square);
}
I mean even in your pseudo code you use "WPanws" bitboard, so you probably have a bitboard for each piece? Then if a capture occurs, you still need to remove a piece from the corresponding bitboard, so it's really not just as simple as piece ^= (from | to) , but you also need to take the captured piece away using something like bitboards_array[captured_piece_type] ^= to_square. In mailbox you might not even need to do that, as moving the piece will overwrite whatever piece is on the destination square in the array.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bitboards ?: still confused where the time savings is

Post by dangi12012 »

Mergi wrote: Sat Nov 06, 2021 4:04 pm
dangi12012 wrote: Sat Nov 06, 2021 2:47 pm
What the huge advantage is: When you change N squares in a mailslot you have to index every square sequentially. When you change N squares in a bitboard it is a single x64 instruction.
That allowes move pruning and check evasion and pins - and nasty pins like the horizontal EP pins to never take more than 2 instructions to be resolved. And all that with less space and time compared to arrays.

Not that its useful but my prime example is : Moving all 8 pawns up one rank on an empty board:

Code: Select all

Wpawns <<= 8;
Checking how many are ready to be promoted now:

Code: Select all

int promo = BitCount(WPawns & LastRank);
Now how would that look in mailslot? I am asking because I am sure it would be more than 3 instructions long for 8 pawns and 8 checks for all 8 promotion squares :)
The code you provided is like the absolute basics of any bitboard engine. The slow part is generating and storing the moves. I would be very curious if you could go into more detail about how you are able to avoid that and perhaps provide some pseudo code as well.

Code: Select all

// after generating the possible_moves bitboard with one or two instructions, this is the slow part
while(possible_moves) {
	Square to_square = PopLsb(possible_moves);
	Square from_square = to_square - push_dir;
	PutMoveToList(to_square, from_square);
}
I mean even in your pseudo code you use "WPanws" bitboard, so you probably have a bitboard for each piece? Then if a capture occurs, you still need to remove a piece from the corresponding bitboard, so it's really not just as simple as piece ^= (from | to) , but you also need to take the captured piece away using something like bitboards_array[captured_piece_type] ^= to_square. In mailbox you might not even need to do that, as moving the piece will overwrite whatever piece is on the destination square in the array.
the from_square is known since its the piece you just in this moment got the possible_moves for - to_square is correct.
The answer to why you dont need to put it on a list is under The Visitor Pattern: https://www.codeproject.com/Articles/53 ... egenerator

The answer to what needs to be done on a taking move is there:
https://github.com/Gigantua/Gigantua/bl ... e.hpp#L630

In short - the algo knows if the move is taking and what piece is moving (constexpr parameter is not a runtime construct). The whole block of code you see there is always one line of code after preprocessing:

If its not taking it boils down to a single xor instruction.
If its a taking move you need to Andnot with the bit you jumped to in the enemy. So thats 5 and + 1 xor.

What you dont need is an array of 64 slots - which doesnt fit in the 16 general purpouse registers of x86-64.
One bitboards fits inside the registers perfectly - with room left for simple algorithms.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mergi
Posts: 127
Joined: Sat Aug 21, 2021 9:55 pm
Full name: Jen

Re: Bitboards ?: still confused where the time savings is

Post by Mergi »

dangi12012 wrote: Sat Nov 06, 2021 4:30 pm
the from_square is known since its the piece you just in this moment got the possible_moves for - to_square is correct.
Im confused. How do you know the from square if you just did

Code: Select all

Wpawns <<= 8;
that was in your example.
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Bitboards ?: still confused where the time savings is

Post by hgm »

dangi12012 wrote: Sat Nov 06, 2021 4:30 pm the from_square is known since its the piece you just in this moment got the possible_moves for - to_square is correct.
The answer to why you dont need to put it on a list is under The Visitor Pattern: https://www.codeproject.com/Articles/53 ... egenerator

The answer to what needs to be done on a taking move is there:
https://github.com/Gigantua/Gigantua/bl ... e.hpp#L630

In short - the algo knows if the move is taking and what piece is moving (constexpr parameter is not a runtime construct). The whole block of code you see there is always one line of code after preprocessing:

If its not taking it boils down to a single xor instruction.
If its a taking move you need to Andnot with the bit you jumped to in the enemy. So thats 5 and + 1 xor.

What you dont need is an array of 64 slots - which doesnt fit in the 16 general purpouse registers of x86-64.
One bitboards fits inside the registers perfectly - with room left for simple algorithms.
So you just clear the to-square in all 6 opponent boards, because you don't know what piece is captured at that point. This is possible, although it doesn't sound very efficient. But having all these boards permanently in a register might make it bearable.

Problem is: how do you now update the piece-square evaluation, not knowing the piece type of the victim? And the hash key? And the pawn-hash key? And the material key?
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bitboards ?: still confused where the time savings is

Post by dangi12012 »

hgm wrote: Sat Nov 06, 2021 5:30 pm
dangi12012 wrote: Sat Nov 06, 2021 4:30 pm the from_square is known since its the piece you just in this moment got the possible_moves for - to_square is correct.
The answer to why you dont need to put it on a list is under The Visitor Pattern: https://www.codeproject.com/Articles/53 ... egenerator

The answer to what needs to be done on a taking move is there:
https://github.com/Gigantua/Gigantua/bl ... e.hpp#L630

In short - the algo knows if the move is taking and what piece is moving (constexpr parameter is not a runtime construct). The whole block of code you see there is always one line of code after preprocessing:

If its not taking it boils down to a single xor instruction.
If its a taking move you need to Andnot with the bit you jumped to in the enemy. So thats 5 and + 1 xor.

What you dont need is an array of 64 slots - which doesnt fit in the 16 general purpouse registers of x86-64.
One bitboards fits inside the registers perfectly - with room left for simple algorithms.
So you just clear the to-square in all 6 opponent boards, because you don't know what piece is captured at that point. This is possible, although it doesn't sound very efficient. But having all these boards permanently in a register might make it bearable.

Problem is: how do you now update the piece-square evaluation, not knowing the piece type of the victim? And the hash key? And the pawn-hash key? And the material key?
Keep in mind: the 5bit clearing of the bits is the worst case for moves where we take (which is known from the algo so its free). Literally all non taking moves take one instruction.

The other points - is what i am working on. I have some working solutions. Some may be efficient - some may be slow. Time will tell.
One thing I can say is that a neural network with 1 layer is mathematically identical to a piece square evaluation with the advantage that the entries are learned. - Networks with multilayer can give you eval for forks etc too.
I dont wanna get ahead of myself - but zobrist is not the future with bitbaords. You can get 1 bit neural network implemented quite efficiently and that way you wouldnt need zobrist keys anymore. (A few xors on a smaller 6 register representation suffices)

Incremental work with zobrist might be slower than a hash that works directly on the bitboards.
Because two lookups into tables living in L1 cache for incremental xoring will be slower than 2-3 instructions working on registers in terms of throughput.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Bitboards ?: still confused where the time savings is

Post by hgm »

dangi12012 wrote: Sat Nov 06, 2021 7:22 pmKeep in mind: the 5bit clearing of the bits is the worst case for moves where we take (which is known from the algo so its free). Literally all non taking moves take one instruction.
Right. But that is 95% of all moves, or so.
The other points - is what i am working on. I have some working solutions. Some may be efficient - some may be slow. Time will tell.
One thing I can say is that a neural network with 1 layer is mathematically identical to a piece square evaluation with the advantage that the entries are learned. - Networks with multilayer can give you eval for forks etc too.
I dont wanna get ahead of myself - but zobrist is not the future with bitbaords. You can get 1 bit neural network implemented quite efficiently and that way you wouldnt need zobrist keys anymore. (A few xors on a smaller 6 register representation suffices)

Incremental work with zobrist might be slower than a hash that works directly on the bitboards.
Because two lookups into tables living in L1 cache for incremental xoring will be slower than 2-3 instructions working on registers in terms of throughput.
I think you would need a lot more than 2-3 instructions, because the hash key must depend on all 12 bitboards. And you cannot do the capture part incrementally on the modified opponent bitboard, because you are unaware which one that is. So you would have to recalculate at least the entire opponent part from scratch.

It will cause tremendous difficulties in many places when you do not know the type of the captured piece. While it is relatively cheap to know that piece type. And then you could use it to solve all these problems.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bitboards ?: still confused where the time savings is

Post by dangi12012 »

hgm wrote: Sat Nov 06, 2021 7:38 pm
dangi12012 wrote: Sat Nov 06, 2021 7:22 pmKeep in mind: the 5bit clearing of the bits is the worst case for moves where we take (which is known from the algo so its free). Literally all non taking moves take one instruction.
Right. But that is 95% of all moves, or so.
The other points - is what i am working on. I have some working solutions. Some may be efficient - some may be slow. Time will tell.
One thing I can say is that a neural network with 1 layer is mathematically identical to a piece square evaluation with the advantage that the entries are learned. - Networks with multilayer can give you eval for forks etc too.
I dont wanna get ahead of myself - but zobrist is not the future with bitbaords. You can get 1 bit neural network implemented quite efficiently and that way you wouldnt need zobrist keys anymore. (A few xors on a smaller 6 register representation suffices)

Incremental work with zobrist might be slower than a hash that works directly on the bitboards.
Because two lookups into tables living in L1 cache for incremental xoring will be slower than 2-3 instructions working on registers in terms of throughput.
I think you would need a lot more than 2-3 instructions, because the hash key must depend on all 12 bitboards. And you cannot do the capture part incrementally on the modified opponent bitboard, because you are unaware which one that is. So you would have to recalculate at least the entire opponent part from scratch.

It will cause tremendous difficulties in many places when you do not know the type of the captured piece. While it is relatively cheap to know that piece type. And then you could use it to solve all these problems.
95% of moves of all chessmoves are taking moves? Probably a little bit overestimaged - no?
I think sending back and forth does not advance chess. Better implement ideas and see which one ends up with +elo or more nps.
Greetings!
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Bitboards ?: still confused where the time savings is

Post by hgm »

Well, what do you expect? About 85% of all nodes are QS nodes, and QS searches only captures. And most cut-nodes in the full-width search are not making (or generating) a move at all, because of null-move pruning.

Elo of course is not a good metric, because it will be dominated by the quality of the evaluation. And it is hard to measure (thousands of games...). As this is about speed, measuring nps in a search of one or a few representative testpositions, with some standard evaluation, is the way to go. This is the benchmark I used in the mailbox trials. The exact statistics of the search tree you can find there(*).

If you want to join in with a bitboard implementation, you are of course welcome.

*) Actually the relevant number came from here; after implementation of null-move pruning only 6% of the real (= non-null) moves was a non-capture.