Branch Prediction Efficiency

Mike Sherwin · Post by **Mike Sherwin** » Tue Dec 27, 2022 11:44 pm

In my bitboard engine for move generation I'm currently doing something like this.

u64 pieces[2];
u64 fsBits = pieces[stm];
do {

} while (fsBits);

As opposed to something like this.

while (pawns[stm]) {}
while (knights[stm]) {}
while (bishops[stm]) {}
etc.

Am I justified to believe the first way is faster?

JacquesRW · Post by **JacquesRW** » Wed Dec 28, 2022 2:46 am

What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.

Mike Sherwin · Post by **Mike Sherwin** » Wed Dec 28, 2022 3:35 am

JacquesRW wrote: ↑Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.

Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.

enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };

There is also a board[64] that contains the pieces. The switch statement should generate a jump table.

Mike Sherwin · Post by **Mike Sherwin** » Wed Dec 28, 2022 3:47 am

The enum I gave was simplistic. Here is the actual enum. The GenMoves function. And the MakeMove function.
None of this has been debugged yet. The GenMoves function does not make a list of moves. It only stores the ts bits in f[fs]. Actual moves will be extracted later.

Code: Select all

enum { WP, WN, WB, WR, WRC, WQ, WK, WKC, ES, BP, BN, BB, BR, BRC, BQ, BK, BKC,
WPD, BPD, WPE, BPE, WPP, BPP, WCS, WCL, BCS, BCL };

s32 GenMoves(Thread* t, u64* f) {
  u64 atk;

  u64 fbits = pieces[stm];
  u64 enemy = pieces[1 - stm];
  u64 occ = fbits | enemy;
  u64 empty = ~occ;
  u64 notme = empty | enemy;

  f[ATK] = 0;

  do {
    s08 fs = std::countr_zero(fbits);
    u64 type = board[fs];
    switch (type) {
    case WP:
      switch (fs >> 3) {
      case RANK1:
        // can't get here
        break;
      case RANK2:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        f[fs] |= ((f[fs] << 8) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK3:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK4:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK5:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & (enemy | epbit[ply]); f[ATK] |= atk;
        break;
      case RANK6:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK7:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      }
      break;
    case WN:
      atk = M42::KnightAttacks[fs];
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WB:
      atk = M42::bishop_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WR:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WRC:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WQ:
      atk = M42::queen_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WK:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WKC:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case ES:
      // can't get here
      break;
    case BP:
      switch (fs >> 3) {
      case RANK1:
        // can't get here
        break;
      case RANK2:
        f[fs] = (1ull << (fs - 8)) & empty;
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK3:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK4:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & (enemy | epbit[ply]); f[ATK] |= atk;
        break;
      case RANK5:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK6:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK7:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      }
      break;
    case BN:
      atk = M42::knight_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BB:
      atk = M42::bishop_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BR:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BRC:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BQ:
      atk = M42::queen_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BK:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BKC:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    }
    fbits ^= 1ull << fs;
  } while (fbits);
  return (kings[1 - stm] & f[ATK]);
}

void MakeMove(Thread* t, uMove* m) {
  s32 sq, et;

  m->s.tt = board[m->s.ts];

  switch (m->s.ft) {
  case WP:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WP;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WN:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WN;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WB:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WB;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WRC:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WR;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WR:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WR;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WQ:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WQ;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WKC:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WK;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	kings[WHITE] ^= 1ull << m->s.fs;
	kings[WHITE] ^= 1ull << m->s.ts;
	break;
  case WK:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WK;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	kings[WHITE] ^= 1ull << m->s.fs;
	kings[WHITE] ^= 1ull << m->s.ts;
	break;
  case ES:
	// can't get here
	break;
  case BP:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BP;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BN:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BN;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BB:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BB;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BRC:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BR;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BR:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BR;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BQ:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BQ;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BKC:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BK;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BK:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BK;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case WPD:
	board[m->s.fs] = ES;
	board[m->s.ts] = WP;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	epbit[ply + 1] = (u64)(m->s.ts - m->s.fs == 16) << (m->s.fs + 8);
	break;
  case BPD:
	board[m->s.fs] = ES;
	board[m->s.ts] = BP;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	epbit[ply + 1] = (u64)(m->s.fs - m->s.ts == 16) << (m->s.fs - 8);
	break;
  case WPE:
	sq = m->s.ts - ((epbit[ply] == (1ull << m->s.ts)) << 3);
	m->s.tt = board[sq];
	board[sq] = ES;
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << sq;
	board[m->s.fs] = ES;
	board[m->s.ts] = WP;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case BPE:
	sq = m->s.ts + ((epbit[ply] == (1ull << m->s.ts)) << 3);
	m->s.tt = board[sq];
	board[sq] = ES;
	bMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << sq;
	board[m->s.fs] = ES;
	board[m->s.ts] = BP;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case WPP:
	bMat -= pieceVal[m->s.tt & 63];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = wProTyp[m->s.tt >> 6];
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case BPP:
	wMat -= pieceVal[m->s.tt & 63];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = bProTyp[m->s.tt >> 6];
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case WCS:
	board[e1] = ES;
	board[g1] = WK;
	pieces[WHITE] ^= 1ull << e1;
	pieces[WHITE] ^= 1ull << g1;
	kings[WHITE] ^= 1ull << e1;
	kings[WHITE] ^= 1ull << g1;
	board[h1] = ES;
	board[f1] = WR;
	pieces[WHITE] ^= 1ull << h1;
	pieces[WHITE] ^= 1ull << f1;
	break;
  case WCL:
	board[e1] = ES;
	board[c1] = WK;
	pieces[WHITE] ^= 1ull << e1;
	pieces[WHITE] ^= 1ull << c1;
	kings[WHITE] ^= 1ull << e1;
	kings[WHITE] ^= 1ull << c1;
	board[a1] = ES;
	board[d1] = WR;
	pieces[WHITE] ^= 1ull << a1;
	pieces[WHITE] ^= 1ull << d1;
	break;
  case BCS:
	board[e8] = ES;
	board[g8] = BK;
	pieces[BLACK] ^= 1ull << e8;
	pieces[BLACK] ^= 1ull << g8;
	kings[BLACK] ^= 1ull << e8;
	kings[BLACK] ^= 1ull << g8;
	board[h8] = ES;
	board[f8] = BR;
	pieces[BLACK] ^= 1ull << h8;
	pieces[BLACK] ^= 1ull << f8;
	break;
  case BCL:
	board[e8] = ES;
	board[c8] = BK;
	pieces[BLACK] ^= 1ull << e8;
	pieces[BLACK] ^= 1ull << c8;
	kings[BLACK] ^= 1ull << e8;
	kings[BLACK] ^= 1ull << c8;
	board[a8] = ES;
	board[d8] = BR;
	pieces[BLACK] ^= 1ull << a8;
	pieces[BLACK] ^= 1ull << d8;
	break;
  }
  ply++;
  stm = 1 - stm;
}

Edit: The philosophy is qsearch efficiency. Since in Qsearch almost all moves are captures the capture is done automatically. Like bMat -= pieceVal[m->s.tt]. If the piece captured is an empty square then bMat -= 0;

hgm · Post by **hgm** » Wed Dec 28, 2022 1:25 pm

Mike Sherwin wrote: ↑Wed Dec 28, 2022 3:35 am
JacquesRW wrote: ↑Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.

enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };

There is also a board[64] that contains the pieces. The switch statement should generate a jump table.

Indirect jumps are almost guaranteed to be mispredicted, though. For normal branches you would still predict correctly in 50% of the cases if you just predicted randomly.

Mike Sherwin · Post by **Mike Sherwin** » Wed Dec 28, 2022 2:55 pm

hgm wrote: ↑Wed Dec 28, 2022 1:25 pm
Mike Sherwin wrote: ↑Wed Dec 28, 2022 3:35 am
JacquesRW wrote: ↑Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.

enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };

There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
Indirect jumps are almost guaranteed to be mispredicted, though. For normal branches you would still predict correctly in 50% of the cases if you just predicted randomly.

I found this for an AMD EPYC 7642.
Tip 1: On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted. Even if the branch was unconditional.
Tip 2: Conditional branches never-taken are basically free - at least on this CPU.
Tip 3: In the hot code you want to have less than 2K function calls - on this CPU.
Tip 4: On this CPU it's possible to get <1 cycle per predicted jmp when the hot loop fits in ~32KiB.

So tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!

hgm · Post by **hgm** » Wed Dec 28, 2022 3:27 pm

Mike Sherwin wrote: ↑Wed Dec 28, 2022 2:55 pmSo tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!

It is partly bogus, leftover lore from the era that branches did not have to be predicted. But you should keep in mind that a single jump from a table does replace a tree of two-way branches, each of which can potentially be mispredicted. So the worst case there is definitely worse than the worst case with a jump table (which happens to be the typical case there).

Branch prediction on most CPUs works by predicting for each N-byte block of code what the N-byte block is that most likely will have to be executed next, and keeps statistics on that for the each such block. When there is an indirect jump in the block, it can continue in many places, and the prediction can only be one of those, based on statistics for just one block. With a tree of two-way branches the branches different branches can be distributed over several N-byte blocks, so that the prediction for the sequence of those you have to traverse can be based on more collected statistics.

Tip 2 probably means that branches are executed in parallel with other instructions, perhaps even combined with those in the process of translating the i86 instructions to micro-Ops. Tip 4 probably has to do with the size of the L1 code cache: collection of statistics on branching is only done for branches that are in this cache.

Mike Sherwin · Post by **Mike Sherwin** » Wed Dec 28, 2022 4:13 pm

hgm wrote: ↑Wed Dec 28, 2022 3:27 pm
Mike Sherwin wrote: ↑Wed Dec 28, 2022 2:55 pmSo tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
It is partly bogus, leftover lore from the era that branches did not have to be predicted. But you should keep in mind that a single jump from a table does replace a tree of two-way branches, each of which can potentially be mispredicted. So the worst case there is definitely worse than the worst case with a jump table (which happens to be the typical case there).

Branch prediction on most CPUs works by predicting for each N-byte block of code what the N-byte block is that most likely will have to be executed next, and keeps statistics on that for the each such block. When there is an indirect jump in the block, it can continue in many places, and the prediction can only be one of those, based on statistics for just one block. With a tree of two-way branches the branches different branches can be distributed over several N-byte blocks, so that the prediction for the sequence of those you have to traverse can be based on more collected statistics.

Tip 2 probably means that branches are executed in parallel with other instructions, perhaps even combined with those in the process of translating the i86 instructions to micro-Ops. Tip 4 probably has to do with the size of the L1 code cache: collection of statistics on branching is only done for branches that are in this cache.

So if the choices are a fairly large tree of binary conditional jumps or a jumptable use the jumptable. Unless there is a third option like:

while(pawns){}
while(knights){}
while(Bishops){}
etc

This last way should be easily predictable thus avoiding the worst case while giving a very good chance for the best case.

dangi12012 · Post by **dangi12012** » Sat Dec 31, 2022 1:50 am

Lucky I can definitely answer this for ZEN architecture but you didnt say that you put a switch inside the inner loop.
This (with the switch inside) is definitely slower:

Code: Select all

u64 fsBits = pieces[stm];
do {
 //Switch
} while (fsBits);

This is many times faster (depending on the code ofc)

Code: Select all

while(pawns){}
while(knights){}
while(Bishops){}

The biggest reason is that from the context of the loops its clear that you are moving for example a knight - and that will have implications or more specific code paths in the engine. For example you maybe already prefill with some knight specific values in the move struct etc.

I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.

Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.

Fastest way (unpublished):

Code: Select all

template <int config> void MoveGen(Board& brd)
uint64_t rooks = brd.rooks();
if constexpr (num_rooks(config) == 2))
{
    int sq0 = poplsb(rooks); 
    int sq1 = poplsb(rooks);
}
if constexpr (num_rooks(config) == 1))
...

Which will all compile away into the perfect code for the current board state.

I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.

Mike Sherwin · Post by **Mike Sherwin** » Sat Dec 31, 2022 2:53 am

dangi12012 wrote: ↑Sat Dec 31, 2022 1:50 am Lucky I can definitely answer this for ZEN architecture but you didnt say that you put a switch inside the inner loop.
This (with the switch inside) is definitely slower:
Code: Select all
u64 fsBits = pieces[stm];
do {
 //Switch
} while (fsBits);
This is many times faster (depending on the code ofc)
Code: Select all
while(pawns){}
while(knights){}
while(Bishops){}
The biggest reason is that from the context of the loops its clear that you are moving for example a knight - and that will have implications or more specific code paths in the engine. For example you maybe already prefill with some knight specific values in the move struct etc.

I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.

Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.

Fastest way (unpublished):
Code: Select all
template <int config> void MoveGen(Board& brd)
uint64_t rooks = brd.rooks();
if constexpr (num_rooks(config) == 2))
{
    int sq0 = poplsb(rooks); 
    int sq1 = poplsb(rooks);
}
if constexpr (num_rooks(config) == 1))
...
Which will all compile away into the perfect code for the current board state.

I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.

Believe it or not I understand everything except where poplsb() comes from. I can't find it on the web. I found ls1b. Is that the same thing? But that probably does not pop the bit. And what is the assembler mnemonic?

Branch Prediction Efficiency

Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency

Re: Branch Prediction Efficiency