In my bitboard engine for move generation I'm currently doing something like this.
u64 pieces[2];
u64 fsBits = pieces[stm];
do {
} while (fsBits);
As opposed to something like this.
while (pawns[stm]) {}
while (knights[stm]) {}
while (bishops[stm]) {}
etc.
Am I justified to believe the first way is faster?
Branch Prediction Efficiency
Moderator: Ras
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
-
- Posts: 127
- Joined: Sat Jul 30, 2022 12:12 pm
- Full name: Jamie Whiting
Re: Branch Prediction Efficiency
What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.
I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Branch Prediction Efficiency
Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.JacquesRW wrote: ↑Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.
I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };
There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Branch Prediction Efficiency
The enum I gave was simplistic. Here is the actual enum. The GenMoves function. And the MakeMove function.
None of this has been debugged yet. The GenMoves function does not make a list of moves. It only stores the ts bits in f[fs]. Actual moves will be extracted later.
Edit: The philosophy is qsearch efficiency. Since in Qsearch almost all moves are captures the capture is done automatically. Like bMat -= pieceVal[m->s.tt]. If the piece captured is an empty square then bMat -= 0;
None of this has been debugged yet. The GenMoves function does not make a list of moves. It only stores the ts bits in f[fs]. Actual moves will be extracted later.
Code: Select all
enum { WP, WN, WB, WR, WRC, WQ, WK, WKC, ES, BP, BN, BB, BR, BRC, BQ, BK, BKC,
WPD, BPD, WPE, BPE, WPP, BPP, WCS, WCL, BCS, BCL };
s32 GenMoves(Thread* t, u64* f) {
u64 atk;
u64 fbits = pieces[stm];
u64 enemy = pieces[1 - stm];
u64 occ = fbits | enemy;
u64 empty = ~occ;
u64 notme = empty | enemy;
f[ATK] = 0;
do {
s08 fs = std::countr_zero(fbits);
u64 type = board[fs];
switch (type) {
case WP:
switch (fs >> 3) {
case RANK1:
// can't get here
break;
case RANK2:
f[fs] = ((0x0000000000000100ull << fs) & empty);
f[fs] |= ((f[fs] << 8) & empty);
atk = M42::pawn_attacks(0, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK3:
f[fs] = ((0x0000000000000100ull << fs) & empty);
atk = M42::pawn_attacks(0, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK4:
f[fs] = ((0x0000000000000100ull << fs) & empty);
atk = M42::pawn_attacks(0, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK5:
f[fs] = ((0x0000000000000100ull << fs) & empty);
atk = M42::pawn_attacks(0, fs);
f[fs] |= atk & (enemy | epbit[ply]); f[ATK] |= atk;
break;
case RANK6:
f[fs] = ((0x0000000000000100ull << fs) & empty);
atk = M42::pawn_attacks(0, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK7:
f[fs] = ((0x0000000000000100ull << fs) & empty);
atk = M42::pawn_attacks(0, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
}
break;
case WN:
atk = M42::KnightAttacks[fs];
f[fs] = atk & notme; f[ATK] |= atk;
break;
case WB:
atk = M42::bishop_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case WR:
atk = M42::rook_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case WRC:
atk = M42::rook_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case WQ:
atk = M42::queen_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case WK:
atk = M42::king_attacks(fs);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case WKC:
atk = M42::king_attacks(fs);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case ES:
// can't get here
break;
case BP:
switch (fs >> 3) {
case RANK1:
// can't get here
break;
case RANK2:
f[fs] = (1ull << (fs - 8)) & empty;
atk = M42::pawn_attacks(1, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK3:
atk = M42::pawn_attacks(1, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK4:
atk = M42::pawn_attacks(1, fs);
f[fs] |= atk & (enemy | epbit[ply]); f[ATK] |= atk;
break;
case RANK5:
atk = M42::pawn_attacks(1, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK6:
atk = M42::pawn_attacks(1, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
case RANK7:
atk = M42::pawn_attacks(1, fs);
f[fs] |= atk & enemy; f[ATK] |= atk;
break;
}
break;
case BN:
atk = M42::knight_attacks(fs);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case BB:
atk = M42::bishop_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case BR:
atk = M42::rook_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case BRC:
atk = M42::rook_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case BQ:
atk = M42::queen_attacks(fs, occ);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case BK:
atk = M42::king_attacks(fs);
f[fs] = atk & notme; f[ATK] |= atk;
break;
case BKC:
atk = M42::king_attacks(fs);
f[fs] = atk & notme; f[ATK] |= atk;
break;
}
fbits ^= 1ull << fs;
} while (fbits);
return (kings[1 - stm] & f[ATK]);
}
void MakeMove(Thread* t, uMove* m) {
s32 sq, et;
m->s.tt = board[m->s.ts];
switch (m->s.ft) {
case WP:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WP;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case WN:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WN;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case WB:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WB;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case WRC:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WR;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case WR:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WR;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case WQ:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WQ;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case WKC:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WK;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
kings[WHITE] ^= 1ull << m->s.fs;
kings[WHITE] ^= 1ull << m->s.ts;
break;
case WK:
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = WK;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
kings[WHITE] ^= 1ull << m->s.fs;
kings[WHITE] ^= 1ull << m->s.ts;
break;
case ES:
// can't get here
break;
case BP:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BP;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BN:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BN;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BB:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BB;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BRC:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BR;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BR:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BR;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BQ:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BQ;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BKC:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BK;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case BK:
wMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = BK;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case WPD:
board[m->s.fs] = ES;
board[m->s.ts] = WP;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
epbit[ply + 1] = (u64)(m->s.ts - m->s.fs == 16) << (m->s.fs + 8);
break;
case BPD:
board[m->s.fs] = ES;
board[m->s.ts] = BP;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
epbit[ply + 1] = (u64)(m->s.fs - m->s.ts == 16) << (m->s.fs - 8);
break;
case WPE:
sq = m->s.ts - ((epbit[ply] == (1ull << m->s.ts)) << 3);
m->s.tt = board[sq];
board[sq] = ES;
bMat -= pieceVal[m->s.tt];
pieces[BLACK] ^= 1ull << sq;
board[m->s.fs] = ES;
board[m->s.ts] = WP;
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case BPE:
sq = m->s.ts + ((epbit[ply] == (1ull << m->s.ts)) << 3);
m->s.tt = board[sq];
board[sq] = ES;
bMat -= pieceVal[m->s.tt];
pieces[WHITE] ^= 1ull << sq;
board[m->s.fs] = ES;
board[m->s.ts] = BP;
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case WPP:
bMat -= pieceVal[m->s.tt & 63];
pieces[BLACK] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = wProTyp[m->s.tt >> 6];
pieces[WHITE] ^= 1ull << m->s.fs;
pieces[WHITE] ^= 1ull << m->s.ts;
break;
case BPP:
wMat -= pieceVal[m->s.tt & 63];
pieces[WHITE] ^= 1ull << m->s.ts;
board[m->s.fs] = ES;
board[m->s.ts] = bProTyp[m->s.tt >> 6];
pieces[BLACK] ^= 1ull << m->s.fs;
pieces[BLACK] ^= 1ull << m->s.ts;
break;
case WCS:
board[e1] = ES;
board[g1] = WK;
pieces[WHITE] ^= 1ull << e1;
pieces[WHITE] ^= 1ull << g1;
kings[WHITE] ^= 1ull << e1;
kings[WHITE] ^= 1ull << g1;
board[h1] = ES;
board[f1] = WR;
pieces[WHITE] ^= 1ull << h1;
pieces[WHITE] ^= 1ull << f1;
break;
case WCL:
board[e1] = ES;
board[c1] = WK;
pieces[WHITE] ^= 1ull << e1;
pieces[WHITE] ^= 1ull << c1;
kings[WHITE] ^= 1ull << e1;
kings[WHITE] ^= 1ull << c1;
board[a1] = ES;
board[d1] = WR;
pieces[WHITE] ^= 1ull << a1;
pieces[WHITE] ^= 1ull << d1;
break;
case BCS:
board[e8] = ES;
board[g8] = BK;
pieces[BLACK] ^= 1ull << e8;
pieces[BLACK] ^= 1ull << g8;
kings[BLACK] ^= 1ull << e8;
kings[BLACK] ^= 1ull << g8;
board[h8] = ES;
board[f8] = BR;
pieces[BLACK] ^= 1ull << h8;
pieces[BLACK] ^= 1ull << f8;
break;
case BCL:
board[e8] = ES;
board[c8] = BK;
pieces[BLACK] ^= 1ull << e8;
pieces[BLACK] ^= 1ull << c8;
kings[BLACK] ^= 1ull << e8;
kings[BLACK] ^= 1ull << c8;
board[a8] = ES;
board[d8] = BR;
pieces[BLACK] ^= 1ull << a8;
pieces[BLACK] ^= 1ull << d8;
break;
}
ply++;
stm = 1 - stm;
}
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Branch Prediction Efficiency
Indirect jumps are almost guaranteed to be mispredicted, though. For normal branches you would still predict correctly in 50% of the cases if you just predicted randomly.Mike Sherwin wrote: ↑Wed Dec 28, 2022 3:35 amInside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.JacquesRW wrote: ↑Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.
I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };
There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Branch Prediction Efficiency
I found this for an AMD EPYC 7642.hgm wrote: ↑Wed Dec 28, 2022 1:25 pmIndirect jumps are almost guaranteed to be mispredicted, though. For normal branches you would still predict correctly in 50% of the cases if you just predicted randomly.Mike Sherwin wrote: ↑Wed Dec 28, 2022 3:35 amInside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.JacquesRW wrote: ↑Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.
I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };
There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
Tip 1: On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted. Even if the branch was unconditional.
Tip 2: Conditional branches never-taken are basically free - at least on this CPU.
Tip 3: In the hot code you want to have less than 2K function calls - on this CPU.
Tip 4: On this CPU it's possible to get <1 cycle per predicted jmp when the hot loop fits in ~32KiB.
So tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Branch Prediction Efficiency
It is partly bogus, leftover lore from the era that branches did not have to be predicted. But you should keep in mind that a single jump from a table does replace a tree of two-way branches, each of which can potentially be mispredicted. So the worst case there is definitely worse than the worst case with a jump table (which happens to be the typical case there).Mike Sherwin wrote: ↑Wed Dec 28, 2022 2:55 pmSo tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
Branch prediction on most CPUs works by predicting for each N-byte block of code what the N-byte block is that most likely will have to be executed next, and keeps statistics on that for the each such block. When there is an indirect jump in the block, it can continue in many places, and the prediction can only be one of those, based on statistics for just one block. With a tree of two-way branches the branches different branches can be distributed over several N-byte blocks, so that the prediction for the sequence of those you have to traverse can be based on more collected statistics.
Tip 2 probably means that branches are executed in parallel with other instructions, perhaps even combined with those in the process of translating the i86 instructions to micro-Ops. Tip 4 probably has to do with the size of the L1 code cache: collection of statistics on branching is only done for branches that are in this cache.
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Branch Prediction Efficiency
So if the choices are a fairly large tree of binary conditional jumps or a jumptable use the jumptable. Unless there is a third option like:hgm wrote: ↑Wed Dec 28, 2022 3:27 pmIt is partly bogus, leftover lore from the era that branches did not have to be predicted. But you should keep in mind that a single jump from a table does replace a tree of two-way branches, each of which can potentially be mispredicted. So the worst case there is definitely worse than the worst case with a jump table (which happens to be the typical case there).Mike Sherwin wrote: ↑Wed Dec 28, 2022 2:55 pmSo tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
Branch prediction on most CPUs works by predicting for each N-byte block of code what the N-byte block is that most likely will have to be executed next, and keeps statistics on that for the each such block. When there is an indirect jump in the block, it can continue in many places, and the prediction can only be one of those, based on statistics for just one block. With a tree of two-way branches the branches different branches can be distributed over several N-byte blocks, so that the prediction for the sequence of those you have to traverse can be based on more collected statistics.
Tip 2 probably means that branches are executed in parallel with other instructions, perhaps even combined with those in the process of translating the i86 instructions to micro-Ops. Tip 4 probably has to do with the size of the L1 code cache: collection of statistics on branching is only done for branches that are in this cache.
while(pawns){}
while(knights){}
while(Bishops){}
etc
This last way should be easily predictable thus avoiding the worst case while giving a very good chance for the best case.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Branch Prediction Efficiency
Lucky I can definitely answer this for ZEN architecture but you didnt say that you put a switch inside the inner loop.
This (with the switch inside) is definitely slower:
This is many times faster (depending on the code ofc)
The biggest reason is that from the context of the loops its clear that you are moving for example a knight - and that will have implications or more specific code paths in the engine. For example you maybe already prefill with some knight specific values in the move struct etc.
I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.
Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.
Fastest way (unpublished):
Which will all compile away into the perfect code for the current board state.
I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.
This (with the switch inside) is definitely slower:
Code: Select all
u64 fsBits = pieces[stm];
do {
//Switch
} while (fsBits);
Code: Select all
while(pawns){}
while(knights){}
while(Bishops){}
I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.
Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.
Fastest way (unpublished):
Code: Select all
template <int config> void MoveGen(Board& brd)
uint64_t rooks = brd.rooks();
if constexpr (num_rooks(config) == 2))
{
int sq0 = poplsb(rooks);
int sq1 = poplsb(rooks);
}
if constexpr (num_rooks(config) == 1))
...
I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Branch Prediction Efficiency
Believe it or not I understand everything except where poplsb() comes from. I can't find it on the web. I found ls1b. Is that the same thing? But that probably does not pop the bit. And what is the assembler mnemonic?dangi12012 wrote: ↑Sat Dec 31, 2022 1:50 am Lucky I can definitely answer this for ZEN architecture but you didnt say that you put a switch inside the inner loop.
This (with the switch inside) is definitely slower:
This is many times faster (depending on the code ofc)Code: Select all
u64 fsBits = pieces[stm]; do { //Switch } while (fsBits);
The biggest reason is that from the context of the loops its clear that you are moving for example a knight - and that will have implications or more specific code paths in the engine. For example you maybe already prefill with some knight specific values in the move struct etc.Code: Select all
while(pawns){} while(knights){} while(Bishops){}
I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.
Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.
Fastest way (unpublished):Which will all compile away into the perfect code for the current board state.Code: Select all
template <int config> void MoveGen(Board& brd) uint64_t rooks = brd.rooks(); if constexpr (num_rooks(config) == 2)) { int sq0 = poplsb(rooks); int sq1 = poplsb(rooks); } if constexpr (num_rooks(config) == 1)) ...
I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.