Branch Prediction Efficiency

Discussion of chess software programming and technical issues.

Moderator: Ras

Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Branch Prediction Efficiency

Post by Mike Sherwin »

In my bitboard engine for move generation I'm currently doing something like this.

u64 pieces[2];
u64 fsBits = pieces[stm];
do {

} while (fsBits);

As opposed to something like this.

while (pawns[stm]) {}
while (knights[stm]) {}
while (bishops[stm]) {}
etc.

Am I justified to believe the first way is faster?
JacquesRW
Posts: 127
Joined: Sat Jul 30, 2022 12:12 pm
Full name: Jamie Whiting

Re: Branch Prediction Efficiency

Post by JacquesRW »

What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Branch Prediction Efficiency

Post by Mike Sherwin »

JacquesRW wrote: Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.

enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };

There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Branch Prediction Efficiency

Post by Mike Sherwin »

The enum I gave was simplistic. Here is the actual enum. The GenMoves function. And the MakeMove function.
None of this has been debugged yet. The GenMoves function does not make a list of moves. It only stores the ts bits in f[fs]. Actual moves will be extracted later.

Code: Select all

enum { WP, WN, WB, WR, WRC, WQ, WK, WKC, ES, BP, BN, BB, BR, BRC, BQ, BK, BKC,
WPD, BPD, WPE, BPE, WPP, BPP, WCS, WCL, BCS, BCL };

s32 GenMoves(Thread* t, u64* f) {
  u64 atk;

  u64 fbits = pieces[stm];
  u64 enemy = pieces[1 - stm];
  u64 occ = fbits | enemy;
  u64 empty = ~occ;
  u64 notme = empty | enemy;

  f[ATK] = 0;

  do {
    s08 fs = std::countr_zero(fbits);
    u64 type = board[fs];
    switch (type) {
    case WP:
      switch (fs >> 3) {
      case RANK1:
        // can't get here
        break;
      case RANK2:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        f[fs] |= ((f[fs] << 8) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK3:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK4:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK5:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & (enemy | epbit[ply]); f[ATK] |= atk;
        break;
      case RANK6:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK7:
        f[fs] = ((0x0000000000000100ull << fs) & empty);
        atk = M42::pawn_attacks(0, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      }
      break;
    case WN:
      atk = M42::KnightAttacks[fs];
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WB:
      atk = M42::bishop_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WR:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WRC:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WQ:
      atk = M42::queen_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WK:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case WKC:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case ES:
      // can't get here
      break;
    case BP:
      switch (fs >> 3) {
      case RANK1:
        // can't get here
        break;
      case RANK2:
        f[fs] = (1ull << (fs - 8)) & empty;
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK3:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK4:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & (enemy | epbit[ply]); f[ATK] |= atk;
        break;
      case RANK5:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK6:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      case RANK7:
        atk = M42::pawn_attacks(1, fs);
        f[fs] |= atk & enemy; f[ATK] |= atk;
        break;
      }
      break;
    case BN:
      atk = M42::knight_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BB:
      atk = M42::bishop_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BR:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BRC:
      atk = M42::rook_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BQ:
      atk = M42::queen_attacks(fs, occ);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BK:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    case BKC:
      atk = M42::king_attacks(fs);
      f[fs] = atk & notme; f[ATK] |= atk;
      break;
    }
    fbits ^= 1ull << fs;
  } while (fbits);
  return (kings[1 - stm] & f[ATK]);
}

void MakeMove(Thread* t, uMove* m) {
  s32 sq, et;

  m->s.tt = board[m->s.ts];

  switch (m->s.ft) {
  case WP:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WP;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WN:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WN;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WB:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WB;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WRC:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WR;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WR:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WR;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WQ:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WQ;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case WKC:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WK;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	kings[WHITE] ^= 1ull << m->s.fs;
	kings[WHITE] ^= 1ull << m->s.ts;
	break;
  case WK:
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = WK;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	kings[WHITE] ^= 1ull << m->s.fs;
	kings[WHITE] ^= 1ull << m->s.ts;
	break;
  case ES:
	// can't get here
	break;
  case BP:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BP;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BN:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BN;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BB:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BB;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BRC:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BR;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BR:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BR;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BQ:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BQ;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BKC:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BK;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case BK:
	wMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = BK;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case WPD:
	board[m->s.fs] = ES;
	board[m->s.ts] = WP;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	epbit[ply + 1] = (u64)(m->s.ts - m->s.fs == 16) << (m->s.fs + 8);
	break;
  case BPD:
	board[m->s.fs] = ES;
	board[m->s.ts] = BP;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	epbit[ply + 1] = (u64)(m->s.fs - m->s.ts == 16) << (m->s.fs - 8);
	break;
  case WPE:
	sq = m->s.ts - ((epbit[ply] == (1ull << m->s.ts)) << 3);
	m->s.tt = board[sq];
	board[sq] = ES;
	bMat -= pieceVal[m->s.tt];
	pieces[BLACK] ^= 1ull << sq;
	board[m->s.fs] = ES;
	board[m->s.ts] = WP;
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case BPE:
	sq = m->s.ts + ((epbit[ply] == (1ull << m->s.ts)) << 3);
	m->s.tt = board[sq];
	board[sq] = ES;
	bMat -= pieceVal[m->s.tt];
	pieces[WHITE] ^= 1ull << sq;
	board[m->s.fs] = ES;
	board[m->s.ts] = BP;
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case WPP:
	bMat -= pieceVal[m->s.tt & 63];
	pieces[BLACK] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = wProTyp[m->s.tt >> 6];
	pieces[WHITE] ^= 1ull << m->s.fs;
	pieces[WHITE] ^= 1ull << m->s.ts;
	break;
  case BPP:
	wMat -= pieceVal[m->s.tt & 63];
	pieces[WHITE] ^= 1ull << m->s.ts;
	board[m->s.fs] = ES;
	board[m->s.ts] = bProTyp[m->s.tt >> 6];
	pieces[BLACK] ^= 1ull << m->s.fs;
	pieces[BLACK] ^= 1ull << m->s.ts;
	break;
  case WCS:
	board[e1] = ES;
	board[g1] = WK;
	pieces[WHITE] ^= 1ull << e1;
	pieces[WHITE] ^= 1ull << g1;
	kings[WHITE] ^= 1ull << e1;
	kings[WHITE] ^= 1ull << g1;
	board[h1] = ES;
	board[f1] = WR;
	pieces[WHITE] ^= 1ull << h1;
	pieces[WHITE] ^= 1ull << f1;
	break;
  case WCL:
	board[e1] = ES;
	board[c1] = WK;
	pieces[WHITE] ^= 1ull << e1;
	pieces[WHITE] ^= 1ull << c1;
	kings[WHITE] ^= 1ull << e1;
	kings[WHITE] ^= 1ull << c1;
	board[a1] = ES;
	board[d1] = WR;
	pieces[WHITE] ^= 1ull << a1;
	pieces[WHITE] ^= 1ull << d1;
	break;
  case BCS:
	board[e8] = ES;
	board[g8] = BK;
	pieces[BLACK] ^= 1ull << e8;
	pieces[BLACK] ^= 1ull << g8;
	kings[BLACK] ^= 1ull << e8;
	kings[BLACK] ^= 1ull << g8;
	board[h8] = ES;
	board[f8] = BR;
	pieces[BLACK] ^= 1ull << h8;
	pieces[BLACK] ^= 1ull << f8;
	break;
  case BCL:
	board[e8] = ES;
	board[c8] = BK;
	pieces[BLACK] ^= 1ull << e8;
	pieces[BLACK] ^= 1ull << c8;
	kings[BLACK] ^= 1ull << e8;
	kings[BLACK] ^= 1ull << c8;
	board[a8] = ES;
	board[d8] = BR;
	pieces[BLACK] ^= 1ull << a8;
	pieces[BLACK] ^= 1ull << d8;
	break;
  }
  ply++;
  stm = 1 - stm;
}
Edit: The philosophy is qsearch efficiency. Since in Qsearch almost all moves are captures the capture is done automatically. Like bMat -= pieceVal[m->s.tt]. If the piece captured is an empty square then bMat -= 0;
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Branch Prediction Efficiency

Post by hgm »

Mike Sherwin wrote: Wed Dec 28, 2022 3:35 am
JacquesRW wrote: Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.

enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };

There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
Indirect jumps are almost guaranteed to be mispredicted, though. For normal branches you would still predict correctly in 50% of the cases if you just predicted randomly.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Branch Prediction Efficiency

Post by Mike Sherwin »

hgm wrote: Wed Dec 28, 2022 1:25 pm
Mike Sherwin wrote: Wed Dec 28, 2022 3:35 am
JacquesRW wrote: Wed Dec 28, 2022 2:46 am What are you currently doing in the body of the first loop? I’m pretty sure most people use the second method.

I would definitely bet that the second case is faster if you’re having to work out what piece is being moved for each iteration of the loop in the first case.
Inside the do loop there is one switch statement to jump to the code for the piece type. The piece types are enumerated like so.

enum { WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK };

There is also a board[64] that contains the pieces. The switch statement should generate a jump table.
Indirect jumps are almost guaranteed to be mispredicted, though. For normal branches you would still predict correctly in 50% of the cases if you just predicted randomly.
I found this for an AMD EPYC 7642.
Tip 1: On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted. Even if the branch was unconditional.
Tip 2: Conditional branches never-taken are basically free - at least on this CPU.
Tip 3: In the hot code you want to have less than 2K function calls - on this CPU.
Tip 4: On this CPU it's possible to get <1 cycle per predicted jmp when the hot loop fits in ~32KiB.

So tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Branch Prediction Efficiency

Post by hgm »

Mike Sherwin wrote: Wed Dec 28, 2022 2:55 pmSo tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
It is partly bogus, leftover lore from the era that branches did not have to be predicted. But you should keep in mind that a single jump from a table does replace a tree of two-way branches, each of which can potentially be mispredicted. So the worst case there is definitely worse than the worst case with a jump table (which happens to be the typical case there).

Branch prediction on most CPUs works by predicting for each N-byte block of code what the N-byte block is that most likely will have to be executed next, and keeps statistics on that for the each such block. When there is an indirect jump in the block, it can continue in many places, and the prediction can only be one of those, based on statistics for just one block. With a tree of two-way branches the branches different branches can be distributed over several N-byte blocks, so that the prediction for the sequence of those you have to traverse can be based on more collected statistics.

Tip 2 probably means that branches are executed in parallel with other instructions, perhaps even combined with those in the process of translating the i86 instructions to micro-Ops. Tip 4 probably has to do with the size of the L1 code cache: collection of statistics on branching is only done for branches that are in this cache.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Branch Prediction Efficiency

Post by Mike Sherwin »

hgm wrote: Wed Dec 28, 2022 3:27 pm
Mike Sherwin wrote: Wed Dec 28, 2022 2:55 pmSo tips 1, 2 and 4 seem to favor the while (pawns) {} way. I'm not sure what tip 3 even means. So everything positive over the last decade or so that I've read about how fast jump tables are compared to conditional jumps is bogus? Absolutely amazing!
It is partly bogus, leftover lore from the era that branches did not have to be predicted. But you should keep in mind that a single jump from a table does replace a tree of two-way branches, each of which can potentially be mispredicted. So the worst case there is definitely worse than the worst case with a jump table (which happens to be the typical case there).

Branch prediction on most CPUs works by predicting for each N-byte block of code what the N-byte block is that most likely will have to be executed next, and keeps statistics on that for the each such block. When there is an indirect jump in the block, it can continue in many places, and the prediction can only be one of those, based on statistics for just one block. With a tree of two-way branches the branches different branches can be distributed over several N-byte blocks, so that the prediction for the sequence of those you have to traverse can be based on more collected statistics.

Tip 2 probably means that branches are executed in parallel with other instructions, perhaps even combined with those in the process of translating the i86 instructions to micro-Ops. Tip 4 probably has to do with the size of the L1 code cache: collection of statistics on branching is only done for branches that are in this cache.
So if the choices are a fairly large tree of binary conditional jumps or a jumptable use the jumptable. Unless there is a third option like:

while(pawns){}
while(knights){}
while(Bishops){}
etc

This last way should be easily predictable thus avoiding the worst case while giving a very good chance for the best case.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Branch Prediction Efficiency

Post by dangi12012 »

Lucky I can definitely answer this for ZEN architecture but you didnt say that you put a switch inside the inner loop.
This (with the switch inside) is definitely slower:

Code: Select all

u64 fsBits = pieces[stm];
do {
 //Switch
} while (fsBits);
This is many times faster (depending on the code ofc)

Code: Select all

while(pawns){}
while(knights){}
while(Bishops){}
The biggest reason is that from the context of the loops its clear that you are moving for example a knight - and that will have implications or more specific code paths in the engine. For example you maybe already prefill with some knight specific values in the move struct etc.

I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.

Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.

Fastest way (unpublished):

Code: Select all

template <int config> void MoveGen(Board& brd)
uint64_t rooks = brd.rooks();
if constexpr (num_rooks(config) == 2))
{
    int sq0 = poplsb(rooks); 
    int sq1 = poplsb(rooks);
}
if constexpr (num_rooks(config) == 1))
...
Which will all compile away into the perfect code for the current board state.

I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Branch Prediction Efficiency

Post by Mike Sherwin »

dangi12012 wrote: Sat Dec 31, 2022 1:50 am Lucky I can definitely answer this for ZEN architecture but you didnt say that you put a switch inside the inner loop.
This (with the switch inside) is definitely slower:

Code: Select all

u64 fsBits = pieces[stm];
do {
 //Switch
} while (fsBits);
This is many times faster (depending on the code ofc)

Code: Select all

while(pawns){}
while(knights){}
while(Bishops){}
The biggest reason is that from the context of the loops its clear that you are moving for example a knight - and that will have implications or more specific code paths in the engine. For example you maybe already prefill with some knight specific values in the move struct etc.

I know even one faster way than all of the above.
Lets assume a position is not entartet (assume less or equal than 2 pieces of any type per color)
If its entartet you call the loop version above.

Have a template accepting the configuration of pieces in the form of rooks, bishops, queens, knights and allos up to including 2 and the color. That is a triplet of 3 values times the color so a very small amount.
Since the template is compiletime you can pass this config in as a single integer.
What you get is a nice call graph of templated functions that are not "too" many and call towards 0 in a nice predictable manner when trading down. Best of all: Inside each invocation the number of each piece (except pawns) is known as a compiletime constant.

Fastest way (unpublished):

Code: Select all

template <int config> void MoveGen(Board& brd)
uint64_t rooks = brd.rooks();
if constexpr (num_rooks(config) == 2))
{
    int sq0 = poplsb(rooks); 
    int sq1 = poplsb(rooks);
}
if constexpr (num_rooks(config) == 1))
...
Which will all compile away into the perfect code for the current board state.

I have shown in another thread that positions with 3 queens or 3 rooks per color are really quite rare in 100s of millions of games and all this loop vs switch is a red herring. In truth you could assume 2, 1 0 depending on the ply and be correct *almost* 100%. Which means you can pop bits greedily or go with the template solution.
Believe it or not I understand everything except where poplsb() comes from. I can't find it on the web. I found ls1b. Is that the same thing? But that probably does not pop the bit. And what is the assembler mnemonic?