Bitboards ?: still confused where the time savings is

Chessnut1071 · Post by **Chessnut1071** » Sat Nov 06, 2021 3:17 am

I realize the space savings using bit boards; however, I'm still confused where the large time savings comes in based on pre-computed array tables. Both methods compute all possible moves from a given position so there's no computation going on, just an array lookup table. The argument that bit shifting takes only one machine cycle is moot since there's no computation or bit shifting on move generation. Even the check mask and capture boards are generated from the lookup tables; however, AND, OR and XOR can be used instead of iterative loops; however, I can't find any large time advantage, especially when you have to identify the pieces and squares affected by that shifting and the time savings is diminished with the translation.

Am I missing something here. If there is a significant time savings I'd gladly dive in and convert, but, I still don't see where the big advantage is on time. I could care less about space, especially when I have tera bytes available. Those of you who developed engines using both approaches, where did you get the time savings?

brianr · Post by **brianr** » Sat Nov 06, 2021 8:59 am

Bitboards do not save time for move generation.
In the past, the savings was thought to be more in position evaluation; however, the savings is not a huge amount.
These days, many engines have moved from hand-crafted evaluation to NNUE, so it is somewhat moot.

hgm · Post by **hgm** » Sat Nov 06, 2021 9:26 am

Bitboards do not save space, do they? At least, the most-popular 'magic bitboards' method is very space intensive, requiring hundreds of KB in tables. Most mailbox implementations hardly use any tables.

phhnguyen · Post by **phhnguyen** » Sat Nov 06, 2021 10:26 am

Everything you can do with BB, you can do with mailbox too

IMO, BB has some advantages over mailbox:
- find/match patterns: typically BB is faster
- cleaner, shorter code in general, especially about finding, matching patterns: that is not about speed but developers can focus on other issues, and they tend to use more with more complicated conditions (for matching patterns). On the other hand, mailbox developers tend to use fewer with simpler conditions
- since 1 & 2, BB can calculate many tasks when needed only. To have the same speed, the mailbox requests many extra data and updates them on the fly. That is much harder to develop and maintain code

After the core (basic functions) is completed and changed to twisting period, BB developers can focus on ideas (implementation is not the problem), but mailbox developers still need to work with both ideas and how to implement them.

dangi12012 · Post by **dangi12012** » Sat Nov 06, 2021 11:05 am

Chessnut1071 wrote: ↑Sat Nov 06, 2021 3:17 am I realize the space savings using bit boards; however, I'm still confused where the large time savings comes in based on pre-computed array tables. Both methods compute all possible moves from a given position so there's no computation going on, just an array lookup table. The argument that bit shifting takes only one machine cycle is moot since there's no computation or bit shifting on move generation. Even the check mask and capture boards are generated from the lookup tables; however, AND, OR and XOR can be used instead of iterative loops; however, I can't find any large time advantage, especially when you have to identify the pieces and squares affected by that shifting and the time savings is diminished with the translation.

Am I missing something here. If there is a significant time savings I'd gladly dive in and convert, but, I still don't see where the big advantage is on time. I could care less about space, especially when I have tera bytes available. Those of you who developed engines using both approaches, where did you get the time savings?

No one answered this correctly.
The time savings in slider lookup is NULL and the speed there is comparable.
With bitboards you have a mobility map in terms of uint64_t and when a piece is pinned or can only move a certain way because of check it takes a single ANDNOT instruction to remove the forbidden squares.
With array based methods you would have to loop or compare (if statement inside a loop) to get the same result.

So the answer is: Large precomputed arrays are not faster - its the other stuff you can do with bitboards thats much faster.
With array based methods you are stuck calculating indices and comparing the array entries. With bitboards you can shift inside a single register and have the result!
Not to mention you can move one piece by:

Code: Select all

//Bitboard - this method can also change multiple squares at once for EP/Castling
piece ^= (from | to) 

//Mailslot - this is much slower!
board[from] = 0 board[to] = piece

You can read more in my signature. The fastest Mailslot method is 6x slower than the fastest bitboard implementation currently for a pure movegenerator.

hgm · Post by **hgm** » Sat Nov 06, 2021 11:59 am

dangi12012 wrote: ↑Sat Nov 06, 2021 11:05 am

Code: Select all

//Bitboard - this method can also change multiple squares at once for EP/Castling
piece ^= (from | to) 

//Mailslot - this is much slower!
board[from] = 0 board[to] = piece

I thought bitboard implementations also used a mailbox board, for quick lookup of the piece type that has to be removed in case of a capture. In which case they would have to do the stuff you call 'much slower' anyway, in addition to the bitboard-specific stuff.

And the piece ^= ... for handling the bitboard part of the game state seems over-symplified: in the usual bitboard game-state representation the locations of the pieces are split by type, so you would have an array locations[pieceType] in stead of a scalar variable 'piece'. And you would not only have to do that for the moved piece, but also for the captured piece (locations[victimType] ^= to). And you would also have to deal with the color occupied board, either regenerating them by ORing the locations[] boards for all 6 piece types together, or incrementally. And then OR the color boards to get the total 'occupied' state needed for calculating slider attacks.

So I think the bitboard update alone is much slower than the mailbox update. And how would you get the victimType if you did not do the mailbox stuff in addition? Loop over the locations[] of the 6 types?

dangi12012 · Post by **dangi12012** » Sat Nov 06, 2021 12:35 pm

hgm wrote: ↑Sat Nov 06, 2021 11:59 am
dangi12012 wrote: ↑Sat Nov 06, 2021 11:05 am
Code: Select all
//Bitboard - this method can also change multiple squares at once for EP/Castling
piece ^= (from | to) 

//Mailslot - this is much slower!
board[from] = 0 board[to] = piece
I thought bitboard implementations also used a mailbox board, for quick lookup of the piece type that has to be removed in case of a capture. In which case they would have to do the stuff you call 'much slower' anyway, in addition to the bitboard-specific stuff.

And the piece ^= ... for handling the bitboard part of the game state seems over-symplified: in the usual bitboard game-state representation the locations of the pieces are split by type, so you would have an array locations[pieceType] in stead of a scalar variable 'piece'. And you would not only have to do that for the moved piece, but also for the captured piece (locations[victimType] ^= to). And you would also have to deal with the color occupied board, either regenerating them by ORing the locations[] boards for all 6 piece types together, or incrementally. And then OR the color boards to get the total 'occupied' state needed for calculating slider attacks.

So I think the bitboard update alone is much slower than the mailbox update. And how would you get the victimType if you did not do the mailbox stuff in addition? Loop over the locations[] of the 6 types?

Nope the bitboard does not need to maintain an additional array. All that is needed is 2*6 uint64_t.
In the case of moving you only need to change one register of the 12 and that can be done with ^=
The algorithm already knows which piece and movetype it currently looks at: https://github.com/Gigantua/Gigantua/bl ... gantua.cpp
Kingcaste - Pawnmove etc... - and that is inlined. So it really boiles down to a XOR Assign inside a loop. I made sure to not fall into the trap to declare my board as an array and have piecetype as an index. The algorithm knows what piece its moving - so that cost becomes void.

So in case of moving its literally one instruction and no branches are needed beforehand. Castling is also just a move.

In case of Enpassant we know that the enemy is a black pawn so its one XOR + one ANDNOT.

In the case of general taking - we have a little bit more - we never ever do branching for chess (to gain the insane speed I achieved) - so we just clear the bit in all 5 opponent bitboards. (king cannot be taken so its 5 not 6). clearing a bit in 5 registers is very cheap. (probably the compiler does that with one avx + one andnot instruction)

The important thing is: Bitboards use no array. Bitboards can be made 100% branchless. Bitboards do not need any indirection and work on a single processor register at any given time.

Nothing can beat bitboards in my optinion!

Chessnut1071 · Post by **Chessnut1071** » Sat Nov 06, 2021 12:56 pm

dangi12012 wrote: ↑Sat Nov 06, 2021 11:05 am
Chessnut1071 wrote: ↑Sat Nov 06, 2021 3:17 am I realize the space savings using bit boards; however, I'm still confused where the large time savings comes in based on pre-computed array tables. Both methods compute all possible moves from a given position so there's no computation going on, just an array lookup table. The argument that bit shifting takes only one machine cycle is moot since there's no computation or bit shifting on move generation. Even the check mask and capture boards are generated from the lookup tables; however, AND, OR and XOR can be used instead of iterative loops; however, I can't find any large time advantage, especially when you have to identify the pieces and squares affected by that shifting and the time savings is diminished with the translation.

Am I missing something here. If there is a significant time savings I'd gladly dive in and convert, but, I still don't see where the big advantage is on time. I could care less about space, especially when I have tera bytes available. Those of you who developed engines using both approaches, where did you get the time savings?
No one answered this correctly.
The time savings in slider lookup is NULL and the speed there is comparable.
With bitboards you have a mobility map in terms of uint64_t and when a piece is pinned or can only move a certain way because of check it takes a single ANDNOT instruction to remove the forbidden squares.
With array based methods you would have to loop or compare (if statement inside a loop) to get the same result.

So the answer is: Large precomputed arrays are not faster - its the other stuff you can do with bitboards thats much faster.
With array based methods you are stuck calculating indices and comparing the array entries. With bitboards you can shift inside a single register and have the result!
Not to mention you can move one piece by:
Code: Select all
//Bitboard - this method can also change multiple squares at once for EP/Castling
piece ^= (from | to) 

//Mailslot - this is much slower!
board[from] = 0 board[to] = piece
You can read more in my signature. The fastest Mailslot method is 6x slower than the fastest bitboard implementation currently for a pure movegenerator.

hgm · Post by **hgm** » Sat Nov 06, 2021 2:10 pm

dangi12012 wrote: ↑Sat Nov 06, 2021 12:35 pm
hgm wrote: ↑Sat Nov 06, 2021 11:59 am
dangi12012 wrote: ↑Sat Nov 06, 2021 11:05 am
Code: Select all
//Bitboard - this method can also change multiple squares at once for EP/Castling
piece ^= (from | to) 

//Mailslot - this is much slower!
board[from] = 0 board[to] = piece
I thought bitboard implementations also used a mailbox board, for quick lookup of the piece type that has to be removed in case of a capture. In which case they would have to do the stuff you call 'much slower' anyway, in addition to the bitboard-specific stuff.

And the piece ^= ... for handling the bitboard part of the game state seems over-symplified: in the usual bitboard game-state representation the locations of the pieces are split by type, so you would have an array locations[pieceType] in stead of a scalar variable 'piece'. And you would not only have to do that for the moved piece, but also for the captured piece (locations[victimType] ^= to). And you would also have to deal with the color occupied board, either regenerating them by ORing the locations[] boards for all 6 piece types together, or incrementally. And then OR the color boards to get the total 'occupied' state needed for calculating slider attacks.

So I think the bitboard update alone is much slower than the mailbox update. And how would you get the victimType if you did not do the mailbox stuff in addition? Loop over the locations[] of the 6 types?
Nope the bitboard does not need to maintain an additional array. All that is needed is 2*6 uint64_t.
In the case of moving you only need to change one register of the 12 and that can be done with ^=
The algorithm already knows which piece and movetype it currently looks at: https://github.com/Gigantua/Gigantua/bl ... gantua.cpp
Kingcaste - Pawnmove etc... - and that is inlined. So it really boiles down to a XOR Assign inside a loop. I made sure to not fall into the trap to declare my board as an array and have piecetype as an index. The algorithm knows what piece its moving - so that cost becomes void.

So in case of moving its literally one instruction and no branches are needed beforehand. Castling is also just a move.

In case of Enpassant we know that the enemy is a black pawn so its one XOR + one ANDNOT.

In the case of general taking - we have a little bit more - we never ever do branching for chess (to gain the insane speed I achieved) - so we just clear the bit in all 5 opponent bitboards. (king cannot be taken so its 5 not 6). clearing a bit in 5 registers is very cheap. (probably the compiler does that with one avx + one andnot instruction)

The important thing is: Bitboards use no array. Bitboards can be made 100% branchless. Bitboards do not need any indirection and work on a single processor register at any given time.

Nothing can beat bitboards in my optinion!

Well, like you say, that is just an opinion which doesn't necessarily make it true. One can easily have a different opinion.

Much of what you say here is demonstrably false. A capture would need to change two of the 12 'registers', not one. In a chess engine the algorithm would not magically know which piece and movetype it is looking at; moves are sorted by MVV/LVA or history, and can come in any order as far as the moving piece is concerned. So this info has to be stored somewhere, (either on a board or in the move encoding), and loaded when you need it. And both a mailbox board and a move list are arrays, which somehow have to be indexed. The move generator will need to know which squares are occupied, and whether these are occupied by friend or foe, to account for blocking. You would have to access all 12 'registers' to conclude a square is empty, if you have no redundancy in the board state. Pieces have a number of moves that changes with board location and blocking, so looping over the moves will require a branch instruction to terminate the loop after the correct number of iterations.

Your perft code is fast, but it is only a perft. So its speed hasn't really any relevance for an engine, which does completely different things. And from what you first wrote about it, it is fast because you use incremental techniques. So comparing its speed with other perft programs, which do not use incremental techniques, doesn't show anything about the speed of the underlying board representation. It just shows that incremental techniques can be highly superior to from-scratch calculation in every node.

dangi12012 · Post by **dangi12012** » Sat Nov 06, 2021 2:47 pm

hgm wrote: ↑Sat Nov 06, 2021 2:10 pm
dangi12012 wrote: ↑Sat Nov 06, 2021 12:35 pm
hgm wrote: ↑Sat Nov 06, 2021 11:59 am
dangi12012 wrote: ↑Sat Nov 06, 2021 11:05 am
Code: Select all
//Bitboard - this method can also change multiple squares at once for EP/Castling
piece ^= (from | to) 

//Mailslot - this is much slower!
board[from] = 0 board[to] = piece
I thought bitboard implementations also used a mailbox board, for quick lookup of the piece type that has to be removed in case of a capture. In which case they would have to do the stuff you call 'much slower' anyway, in addition to the bitboard-specific stuff.

And the piece ^= ... for handling the bitboard part of the game state seems over-symplified: in the usual bitboard game-state representation the locations of the pieces are split by type, so you would have an array locations[pieceType] in stead of a scalar variable 'piece'. And you would not only have to do that for the moved piece, but also for the captured piece (locations[victimType] ^= to). And you would also have to deal with the color occupied board, either regenerating them by ORing the locations[] boards for all 6 piece types together, or incrementally. And then OR the color boards to get the total 'occupied' state needed for calculating slider attacks.

So I think the bitboard update alone is much slower than the mailbox update. And how would you get the victimType if you did not do the mailbox stuff in addition? Loop over the locations[] of the 6 types?
Nope the bitboard does not need to maintain an additional array. All that is needed is 2*6 uint64_t.
In the case of moving you only need to change one register of the 12 and that can be done with ^=
The algorithm already knows which piece and movetype it currently looks at: https://github.com/Gigantua/Gigantua/bl ... gantua.cpp
Kingcaste - Pawnmove etc... - and that is inlined. So it really boiles down to a XOR Assign inside a loop. I made sure to not fall into the trap to declare my board as an array and have piecetype as an index. The algorithm knows what piece its moving - so that cost becomes void.

So in case of moving its literally one instruction and no branches are needed beforehand. Castling is also just a move.

In case of Enpassant we know that the enemy is a black pawn so its one XOR + one ANDNOT.

In the case of general taking - we have a little bit more - we never ever do branching for chess (to gain the insane speed I achieved) - so we just clear the bit in all 5 opponent bitboards. (king cannot be taken so its 5 not 6). clearing a bit in 5 registers is very cheap. (probably the compiler does that with one avx + one andnot instruction)

The important thing is: Bitboards use no array. Bitboards can be made 100% branchless. Bitboards do not need any indirection and work on a single processor register at any given time.

Nothing can beat bitboards in my optinion!
Well, like you say, that is just an opinion which doesn't necessarily make it true. One can easily have a different opinion.

Much of what you say here is demonstrably false. A capture would need to change two of the 12 'registers', not one. In a chess engine the algorithm would not magically know which piece and movetype it is looking at; moves are sorted by MVV/LVA or history, and can come in any order as far as the moving piece is concerned. So this info has to be stored somewhere, (either on a board or in the move encoding), and loaded when you need it. And both a mailbox board and a move list are arrays, which somehow have to be indexed. The move generator will need to know which squares are occupied, and whether these are occupied by friend or foe, to account for blocking. You would have to access all 12 'registers' to conclude a square is empty, if you have no redundancy in the board state. Pieces have a number of moves that changes with board location and blocking, so looping over the moves will require a branch instruction to terminate the loop after the correct number of iterations.

Your perft code is fast, but it is only a perft. So its speed hasn't really any relevance for an engine, which does completely different things. And from what you first wrote about it, it is fast because you use incremental techniques. So comparing its speed with other perft programs, which do not use incremental techniques, doesn't show anything about the speed of the underlying board representation. It just shows that incremental techniques can be highly superior to from-scratch calculation in every node.

Let me just say that MCTS engines would beg to differ for your last point. Second I dont think we have different opinions at all - a few sentences of english are just not precice enough to convey a full idea - and leaves too much for interpretation. I am still waiting for neuralink on that one.

Also currently I am writing an engine and I dont store moves at all (I dont have a movelist either) but sorting for the best move etc is still possible - a movelist is not needed anymore => the nodes in the tree are positions and evaluations. I could still implement this in terms of 16 bits but that would be slower.

I am telling you - with bitboards you dont need any indexes to arrays anymore ever. Except when looking up sliders etc - everything mailslots do in 0...63 space a bitboard implementation does in (1ull << 0..63) space. (So one set bit in a register corresponds to a index in an array)

What the huge advantage is: When you change N squares in a mailslot you have to index every square sequentially. When you change N squares in a bitboard it is a single x64 instruction.
That allowes move pruning and check evasion and pins - and nasty pins like the horizontal EP pins to never take more than 2 instructions to be resolved. And all that with less space and time compared to arrays.

Not that its useful but my prime example is : Moving all 8 pawns up one rank on an empty board:

Code: Select all

Wpawns <<= 8;

Checking how many are ready to be promoted now:

Code: Select all

int promo = BitCount(WPawns & LastRank);

Now how would that look in mailslot? I am asking because I am sure it would be more than 3 instructions long for 8 pawns and 8 checks for all 8 promotion squares

Bitboards ?: still confused where the time savings is

Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is

Re: Bitboards ?: still confused where the time savings is