Compiler Optimization Question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Look
Posts: 364
Joined: Thu Jun 05, 2014 2:14 pm
Location: Iran
Full name: Mehdi Amini

Re: Compiler Optimization Question

Post by Look »

Michael Sherwin wrote: Mon Apr 13, 2020 1:11 am A really large function has many local variables and accesses many global variables.
[...]

I recommend breaking a large function into smaller ones. Each new function doing what it's name suggests. You should not use a function that has many arguments.
Farewell.
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Compiler Optimization Question

Post by Sopel »

I'd advise using clang-cl toolchain for Vistual studio. Especially for heavy numerical work. https://godbolt.org/z/id9Ujf

Breaking the function into smaller ones may harm performance if they don't get inlined, especially because you use global variables which would have to be reloaded after every call. You should consider creating smaller functions though if it makes architectural sense, the performance difference would likely be immesurable. Most likely there are more performance critical places in your code
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Compiler Optimization Question

Post by mvanthoor »

I've just quickly scrolled through your GenAllMoves() function. My reaction is basically this: :shock:

A function almost 300 lines long, pure code, and you 'just' need to add the castling code? What sort of move generator is this based on? My gen::all_moves() file/module is barely 200 lines long (and split up into functions), including blank lines and comments.

If you want to keep that readable, you should certainly refactor this. If you're worried that the result will be slower due to function call overhead, just add an "inline" annotation to it to make sure the compiler inlines the parts again. Then you'll have the best of both worlds.

I've seen some of your code snippets around the forums. Please don't take this the wrong way, but they often read like mathematical formulas wrapped in if/switch statements. If I compare both coding styles, I know which function I can still understand a year after I've written it. I'm sorry to say it isn't yours. The reasons are all the nested if/switch statements, and the use of one-letter variables.

Your function:

Code: Select all

void GenAllMovesQ(s32 t, s32 alpha, s32 beta, s32 depth) {
  s32 pce, typ, target, adjtyp, score, i;
  u64 bb, pieces, wPieces, bPieces, aPieces, empty, notme;
  u32 fs, ts, sq;

  i = first[t][ply[t]];

  pieces = piece[t][wtm[t]];
  wPieces = piece[t][WHITE];
  bPieces = piece[t][BLACK];
  aPieces = wPieces | bPieces;
  empty = 0xffffffffffffffff ^ aPieces;
  notme = 0xffffffffffffffff ^ pieces;

  do {
    _BitScanForward64(&fs, pieces);
    pce = board[t][fs];
    switch (pce) {
      case OO: break;
      case WP:
        typ = fs >> 3;
        switch (typ) {
          case RANK1: break;
          case RANK2:
            _BitScanForward64(&sq, wPawnMoves[fs] & aPieces);
            bb = (wPawnMoves[fs] & below[sq]) | (wPawnCapts[fs] & bPieces);
            typ = WDOU;
            break;
          case RANK3:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK4:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK5:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & (bPieces | epbit[t][ply[t]]));
            typ = WCEP;
            break;
          case RANK6:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK7:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WPRO;
            break;
        }
        break;
      case WN:
        bb = knightMoves[fs] & notme;
        typ = WNMV;
        break;
      case WB:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & bob[fs]
           & notme;
        typ = WBMV;
        break;
      case WR:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & rob[fs]
           & notme;
        typ = WRMV;
        break;
      case WQ:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & notme;
        typ = WQMV;
        break;
      case WK:
        bb = kingMoves[fs] & notme;
        typ = WKMV;
        break;
      case BP:
        typ = fs >> 3;
        switch (typ) {
          case RANK1: break;
          case RANK2:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BPRO;
          break;
          case RANK3:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BSGL;
            break;
          case RANK4:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & (wPieces | epbit[t][ply[t]]));
            typ = BCEP;
            break;
          case RANK5:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            BSGL;
            break;
          case RANK6:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BSGL;
            break;
          case RANK7:
            _BitScanForward64(&sq, bPawnMoves[fs] & aPieces);
            bb = (bPawnMoves[fs] & above[sq]) | (bPawnCapts[fs] & wPieces);
            typ = BPRO;
            break;
        }
        break;
      case BN:
        bb = knightMoves[fs] & notme;
        typ = BNMV;
        break;
      case BB:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & bob[fs]
           & notme;
        typ = BBMV;
        break;
      case BR:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & rob[fs]
           & notme;
        typ = BRMV;
        break;
      case BQ:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & notme;
        typ = BQMV;
        break;
      case BK:
        bb = kingMoves[fs] & notme;
        typ = BKMV;
        break;
    }

    while (bb) {
      _BitScanForward64(&ts, bb);
      board[t][fs] = OO;
      piece[t][wtm[t]] ^= one << fs;
      piece[t][wtm[t]] ^= one << ts;
      adjtyp = adjTyp[typ];
      switch (adjtyp) {
        case WMOV:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = pce;
          break;
        case WDUO:
          board[t][ts] = pce;
          epbit[t][ply[t] + 1] = (u64)(ts - fs == 16) << (fs + 8); // method from H.G.Muller
          break;
        case WEPC:
          sq = ts - ((u64)(epbit[t][ply[t]] == one << ts) << 3); // method from H.G.Muller
          target = board[t][sq];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << sq;
          board[t][ts] = pce;
          break;
        case WPRM:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = WQ;
          mat[t] += 800;
          break;
        case BMOV:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = pce;
          break;
        case BDUO:
          board[t][ts] = pce;
          epbit[t][ply[t] + 1] = (u64)(fs - ts == 16) << (fs - 8); // method from H.G.Muller
          break;
        case BEPC:
          sq = ts + ((u64)(epbit[t][ply[t]] == one << ts) << 3); // method from H.G.Muller
          target = board[t][sq];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << sq;
          board[t][ts] = pce;
          break;
        case BPRM:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = WQ;
          mat[t] -= 800;
          break;
      }

      wtm[t] = 1 - wtm[t];
      ply[t]++;

      score = Qsearch(t, alpha, beta);

      ply[t]--;
      wtm[t] = 1 - wtm[t];

      mat[t] -= pceval[target];
      board[t][fs] = OO;
      piece[t][wtm[t]] ^= one << fs;
      piece[t][wtm[t]] ^= one << ts;

      switch (adjtyp) {
        case WMOV:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          break;
        case WDUO:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          epbit[t][ply[t] + 1] = 0;
          break;
        case WEPC:
          board[t][sq] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << sq;
          break;
        case WPRM:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          mat[t] -= 800;
          break;
        case BMOV:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          break;
        case BDUO:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          epbit[t][ply[t] + 1] = 0;
          break;
        case BEPC:
          board[t][sq] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << sq;
          break;
        case BPRM:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          mat[t] += 800;
          break;
      }
    }

    if (score != -INFINITY && (score > alpha || depth > 0)) {
      frsq[t][i] = fs;
      tosq[t][i] = ts;
      trgt[t][i] = target;
      type[t][i] = typ;
      scor[t][i] = score;
      i++;
    }

    pieces ^= (one << fs);
  } while (pieces);

  // castling code goes here

}
My file:
(As you can see, instead of long comments, I prefer using a lot of descriptive "let statements" (variable declarations in Rust). The compiler will optimize most of them away, possibly even all of them.)

Code: Select all

< a few "use" statements here (similar to "include"  or "import") >

const PROMOTION_PIECES: [usize; 4] = [QUEEN, ROOK, BISHOP, KNIGHT];

// This function generates all pseudo-legal moves for the current board and side to move.
pub fn all_moves(board: &Board, list: &mut MoveList) {
    piece(KING, board, list);
    piece(KNIGHT, board, list);
    piece(ROOK, board, list);
    piece(BISHOP, board, list);
    piece(QUEEN, board, list);
    pawns(board, list);
    castling(board, list);
}

/// This function generates pseudo-legal moves for the given piece type.
fn piece(piece: Piece, board: &Board, list: &mut MoveList) {
    let side = board.game_state.active_color as usize;
    let bb_occupancy = board.occupancy();
    let bb_own_pieces = board.bb_pieces[side];
    let mut bb_pieces = board.get_pieces(piece, side);

    // Generate moves for each piece of the type passed into the function.
    while bb_pieces > 0 {
        let from = bits::next(&mut bb_pieces);
        let bb_target = match piece {
            QUEEN | ROOK | BISHOP => board.get_slider_attacks(piece, from, bb_occupancy),
            KING | KNIGHT => board.get_non_slider_attacks(piece, from),
            _ => 0,
        };

        // A piece can move to where there is no piece of our own.
        let bb_moves = bb_target & !bb_own_pieces;
        add_move(board, piece, from, bb_moves, list);
    }
}

// This function generates all the pawn moves.
fn pawns(board: &Board, list: &mut MoveList) {
    const UP: i8 = 8;
    const DOWN: i8 = -8;

    let side = board.game_state.active_color as usize;
    let bb_opponent_pieces = board.bb_pieces[side ^ 1];
    let bb_empty = !board.occupancy();
    let bb_fourth = if side == WHITE {
        BB_RANKS[RANK_4]
    } else {
        BB_RANKS[RANK_5]
    };
    let mut bb_pawns = board.get_pieces(PAWN, side);
    let direction = if side == WHITE { UP } else { DOWN };

    // As long as there are pawns, generate moves for each of them.
    while bb_pawns > 0 {
        let from = bits::next(&mut bb_pawns);
        let bb_push = 1u64 << (from as i8 + direction);
        let bb_one_step = bb_push & bb_empty;
        let bb_two_step = bb_one_step.rotate_left((64 + direction) as u32) & bb_empty & bb_fourth;
        let bb_targets = board.get_pawn_attacks(side, from);
        let bb_captures = bb_targets & bb_opponent_pieces;
        let bb_ep_capture = if let Some(ep) = board.game_state.en_passant {
            bb_targets & (1u64 << ep)
        } else {
            0
        };

        // Gather all moves for the pawn into one bitboard.
        let bb_moves = bb_one_step | bb_two_step | bb_captures | bb_ep_capture;
        add_move(board, PAWN, from, bb_moves, list);
    }
}

// This function generates castling moves (king part only).
fn castling(board: &Board, list: &mut MoveList) {
    let side = board.game_state.active_color as usize;
    let opponent = side ^ 1;
    let castle_perms_white = (board.game_state.castling & (CASTLE_WK | CASTLE_WQ)) > 0;
    let castle_perms_black = (board.game_state.castling & (CASTLE_BK | CASTLE_BQ)) > 0;
    let bb_occupancy = board.occupancy();
    let mut bb_king = board.get_pieces(KING, side);
    let from = bits::next(&mut bb_king);

    if side == WHITE && castle_perms_white {
        // Kingside
        if board.game_state.castling & CASTLE_WK > 0 {
            let bb_kingside_blockers: u64 = (1u64 << F1) | (1u64 << G1);
            let is_kingside_blocked = (bb_occupancy & bb_kingside_blockers) > 0;

            if !is_kingside_blocked
                && !info::square_attacked(board, opponent, E1)
                && !info::square_attacked(board, opponent, F1)
            {
                let to = (1u64 << from) << 2;
                add_move(board, KING, from, to, list);
            }
        }

        if board.game_state.castling & CASTLE_WQ > 0 {
            // Queenside
            let bb_queenside_blockers: u64 = (1u64 << B1) | (1u64 << C1) | (1u64 << D1);
            let is_queenside_blocked = (bb_occupancy & bb_queenside_blockers) > 0;

            if !is_queenside_blocked
                && !info::square_attacked(board, opponent, E1)
                && !info::square_attacked(board, opponent, D1)
            {
                let to = (1u64 << from) >> 2;
                add_move(board, KING, from, to, list);
            }
        }
    } else if side == BLACK && castle_perms_black {
        // Kingside
        if board.game_state.castling & CASTLE_BK > 0 {
            let bb_kingside_blockers: u64 = (1u64 << F8) | (1u64 << G8);
            let is_kingside_blocked = (bb_occupancy & bb_kingside_blockers) > 0;

            if !is_kingside_blocked
                && !info::square_attacked(board, opponent, E8)
                && !info::square_attacked(board, opponent, F8)
            {
                let to = (1u64 << from) << 2;
                add_move(board, KING, from, to, list);
            }
        }

        // Queenside
        if board.game_state.castling & CASTLE_BQ > 0 {
            let bb_queenside_blockers: u64 = (1u64 << B8) | (1u64 << C8) | (1u64 << D8);
            let is_queenside_blocked = (bb_occupancy & bb_queenside_blockers) > 0;

            if !is_queenside_blocked
                && !info::square_attacked(board, opponent, E8)
                && !info::square_attacked(board, opponent, D8)
            {
                let to = (1u64 << from) >> 2;
                add_move(board, KING, from, to, list);
            }
        }
    }
}

// This function turns the given parameters into actual moves and puts them into the move list.
fn add_move(board: &Board, piece: Piece, from: u8, to: Bitboard, list: &mut MoveList) {
    let mut bb_to = to;
    let side = board.game_state.active_color as usize;
    let promotion_rank = (if side == WHITE { RANK_8 } else { RANK_1 }) as u8;

    // As long as there are still to-squres in bb_to, this piece has moves to add.
    while bb_to > 0 {
        let to_square = bits::next(&mut bb_to);
        let capture = board.piece_list[to_square as usize];
        let promotion = (piece == PAWN) && board::square_on_rank(to_square, promotion_rank);
        let en_passant = if let Some(square) = board.game_state.en_passant {
            (piece == PAWN) && (square == to_square)
        } else {
            false
        };
        let double_step = (piece == PAWN) && ((to_square as i8 - from as i8).abs() == 16);
        let castling = (piece == KING) && ((to_square as i8 - from as i8).abs() == 2);

        // Gather all data for this move into one 64-bit integer.
        let move_data = (piece as u64)
            | ((from as u64) << Shift::FromSq as u64)
            | ((to_square as u64) << Shift::ToSq as u64)
            | ((capture as u64) << Shift::Capture as u64)
            | ((en_passant as u64) << Shift::EnPassant as u64)
            | ((double_step as u64) << Shift::DoubleStep as u64)
            | ((castling as u64) << Shift::Castling as u64);

        if !promotion {
            // No promotion: add this move.
            let m = Move {
                data: move_data | ((PNONE as u64) << Shift::Promotion as u64),
            };
            list.push(m);
        } else {
            // Promotion. Add one move for each piece to promote to.
            for piece in PROMOTION_PIECES.iter() {
                let m = Move {
                    data: move_data | ((*piece as u64) << Shift::Promotion as u64),
                };
                list.push(m);
            }
        }
    }
}
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin »

What sort of move generator is this based on?
This is SISSY bitboards. https://www.chessprogramming.org/SISSY_Bitboards Right now I am just using the queen code for all sliders because it is debugged and working properly. I'll write separate code for the bishop and rook once the engine is up and running.

I guess my coding style is because of how I think and my work with assembler. Yes, the code is 300+ lines but only a small amount of that code gets executed each loop. So in execution it is not all that big. Also it includes inline make move and unmake move. And there is only one "if" statement in all that code! Of course the castling code will require more "if" statements but they will be outside the generation loop. To each their own, right? Anyway the ever changing state of a chess engine makes branch prediction probamatic. Also there are many other factors that can disrupt branch prediction. Cache thrashing and memory alignment and even the linker and link order can disrupt branch prediction.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin »

Sopel wrote: Tue Apr 14, 2020 1:36 pm I'd advise using clang-cl toolchain for Vistual studio. Especially for heavy numerical work. https://godbolt.org/z/id9Ujf

Breaking the function into smaller ones may harm performance if they don't get inlined, especially because you use global variables which would have to be reloaded after every call. You should consider creating smaller functions though if it makes architectural sense, the performance difference would likely be immesurable. Most likely there are more performance critical places in your code
I tried that about a month ago and I could not get it working. Clang does a much better job optimising SISSY bitboards than other compilers. On the other hand I have written my own assembler code for SISSY that I prefer. But, I won't know for sure until it can be tested.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Compiler Optimization Question

Post by bob »

One more note for profiling. Profile the code you are going to run... the complete chess engine. Don't take a snippet, rewrite it in a way you think might be faster, then just profile the two snippets with a simple driver program. That can be highly misleading since you are not executing all the other chess code between passes through the snippets. This hides memory access delays, cache thrashing, cache invalidations, etc, etc and etc. Profile it like you plan on running it and you will get useful information.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin »

bob wrote: Wed Apr 15, 2020 2:40 am One more note for profiling. Profile the code you are going to run... the complete chess engine. Don't take a snippet, rewrite it in a way you think might be faster, then just profile the two snippets with a simple driver program. That can be highly misleading since you are not executing all the other chess code between passes through the snippets. This hides memory access delays, cache thrashing, cache invalidations, etc, etc and etc. Profile it like you plan on running it and you will get useful information.
Thanks Bob, I have been studying these factors and therefore I understand the reasons behind everything you mentioned. That is why I know what questions to ask. Still it never hurts to have confirmation by someone with your experience!

When RomiChess was written I was using MSVC 2005. RomiChess would not run in debug mode. So I had to get it working without using a debugger. And I did not know about the assert function. So, that was a lot of fun -- NOT, lol.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Compiler Optimization Question

Post by bob »

I know even more. I developed code when there WAS no profiler or debugger except for me. :)

Those were really fun days...