Fast lookup of pieces for move generation without bitboards

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

User avatar
hgm
Posts: 27702
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Fast lookup of pieces for move generation without bitboa

Post by hgm »

zd3nik wrote:But I'm not going to use a captured flag on each piece. I'm going to do the compacting thing - compaction only occurs within each inner list. And I'm grouping all the sliding pieces into one inner list. That will make MVV/LVA logic more complicated, but it solves other problems specific to the engine I'm tinkering with.
Beware that I expect the solution with the presence mask to be a faster, and definitely a lot cleaner. It would save you from updating list[victim].location; you could just leave that at the square where it was captured, as it would never be accessed again until the 'present' bit of the mask would be set again (i.e. on UnMake of the capture). Then you don't have to alter codes of pieces on the board, you would not have piece numbering change in the course of searching the same node (which IMO would make the engine extremely bug prone). You can assign the bits to the pieces in such a way that list[n].bit*DEBRUIJN>>27 would immediately give you the piece number, without the need for a table lookup.

Note that the loops for individual piece types you get typically just do 0, 1 or 2 iterations, so that the loop branch will become hard to predict and cause lots of mispredictions. It would no longer be true that it is mostly taken, and only causes a mispredict at the end of the loop.

As an aside, using index to determine type doesn't seem very efficient, as you would have to test for a range of values, which takes more than twice as long as testing for a single one. Also note that it is very ill advised to test for things in general: it implies branches, which could be mispredicted. Your pseudocode in the first post is extremely inefficient (if you can say such a thing from pseudo-code). Rather than having a switch(board[sqr]) and separate functions for each piece type you can do

Code: Select all

nr = board[sqr];
type = list[nr].type;
GenerateMoves(type, sqr);
where GenerateMoves then would just take the possible moves from a type-dependent list:

Code: Select all

while((step = steps[type++]) != 0) {
  // generate moves with board-step 'step'
}
The 'type' would then be encoded as the offset in the array 'steps' where the moves for that piece begin. Something like

Code: Select all

int steps[] ={
1, -1, 16, -16, 0, // Rook moves
15, 17, -15, -17, 0, // Bishop moves
14, 31, 33, 18, -14, -31, -33, -18, 0, // Knight moves
...
};
and type=0 would mean Rook, type=5 would mean Bishop, type=10 would mean Knight, etc.
Anyway, we'll see how it works, if I can get it to work. I suspect all the extra incremental maintenance and de-referencing is going to slow things down enough that it won't be any faster in the end than what I have now (which is basically just a perft routine at present - no qsearch and no positional eval). The piece lists should help tremendously in qsearch move generation and positional eval though, so hopefully it's worth the extra complexity.
I expect you will see a significant speedup if the code for other things in your engine isn't unnecessary slow, so that that would be the bottleneck. Especially in the end-game when the board population starts thinning the board-scan will start to soak up a lot of time.
zd3nik
Posts: 193
Joined: Wed Mar 11, 2015 3:34 am
Location: United States

Re: Fast lookup of pieces for move generation without bitboa

Post by zd3nik »

hgm wrote:As an aside, using index to determine type doesn't seem very efficient, as you would have to test for a range of values, which takes more than twice as long as testing for a single one.
I must have misunderstood your original description. I thought you implied the fixed offset ranges where each piece type is stored was used to determine piece type (in some cases). But I was indeed a bit skeptical about using that method. As you point out it would require two tests instead of one.

Anyway, I'm only using the offset to determine whether a piece is a slider (_board[H1] >= FirstSliderOffset). So this is a moot point for my project.
hgm wrote:Also note that it is very ill advised to test for things in general: it implies branches, which could be mispredicted. Your pseudocode in the first post is extremely inefficient (if you can say such a thing from pseudo-code). Rather than having a switch(board[sqr]) and separate functions for each piece type you can do

Code: Select all

nr = board[sqr];
type = list[nr].type;
GenerateMoves(type, sqr);
where GenerateMoves then would just take the possible moves from a type-dependent list:

Code: Select all

while((step = steps[type++]) != 0) {
  // generate moves with board-step 'step'
}
The 'type' would then be encoded as the offset in the array 'steps' where the moves for that piece begin. Something like

Code: Select all

int steps[] ={
1, -1, 16, -16, 0, // Rook moves
15, 17, -15, -17, 0, // Bishop moves
14, 31, 33, 18, -14, -31, -33, -18, 0, // Knight moves
...
};
and type=0 would mean Rook, type=5 would mean Bishop, type=10 would mean Knight, etc.
I was under the impression that any modern compiler worth spit will turn switch statements on small sets of integer values into an indexed jump rather than a bunch of branches. Something like this:

Code: Select all

goto label[value_as_index]
label_for_value_1:
...
label_for_value_2:
...
etc
Isn't this essentially how switch statements on integer values work? And if so does this still cause occasional pipeline stalls? And if your using compile time constants in the switch statement (which isn't the case in the code in question, but something I often do) the compiler can do it's magic to eliminate all but the matched case.

All that being said, I like your suggested alternative to using switch statements. But it would require an almost complete rewrite of my project.
hgm wrote:I expect you will see a significant speedup if the code for other things in your engine isn't unnecessary slow, so that that would be the bottleneck. Especially in the end-game when the board population starts thinning the board-scan will start to soak up a lot of time.
This project is essentially nothing more than a perft analyzer at present. No qsearch, no positional eval, etc. So there isn't any code other than the move generation to slow things down (yet).

I just finished the first draft of the change to use piece lists. Since I'm adapting existing code to use the piece list concept rather than writing from scratch with piece lists in mind I'm sure my implementation isn't 100% optimal. But if it isn't a net loss the piece lists will definitely come in handy in various parts of the engine (which aren't written yet).

I start and stop a lot of experimental chess engine implementations. And I don't like to make code public until it's at least feature complete. But if anyone would like to see the code in it's current state to see how I implemented piece lists let me know.
zd3nik
Posts: 193
Joined: Wed Mar 11, 2015 3:34 am
Location: United States

Re: Fast lookup of pieces for move generation without bitboa

Post by zd3nik »

Initial results are in and it's a general success!

The GenerateMoves function is no where near the top in terms of cost any more. before this change it was 2nd most expensive function. After the change I only see the functions I expect at the top of the list now (GenerateMoves nowhere near the top).

* Average KLeafs/sec before change: 50,197
* Average KLeafs/sec after change: 53,917

Not a huge gain in terms of raw leafs/sec, but considering the extra code complexity and additional memory updates in move Exec and Undo calls it's impressive that it was a net gain at all. And the piece lists will come in very handy in future routines such as QSearchMoveGen and SEE etc.

As for the tiny net improvement in leafs/sec keep in mind I adapted an existing move generator to use piece lists. I didn't write the move generator with piece lists in mind. So my implementation doesn't use all the tricks. For example, my code relies on board slots being set to empty when there's no piece there, so I'm not using the trick of leaving unoccupied board slots as-is. This means my Exec and Undo calls must do a lot more writes to memory than a pure piece-lists implementation might.

And I'm using piece pointers instead of indexes in the board array. Meaning board[square] contains a pointer to a piece struct, which lives somewhere in the piece list. So I can lookup the type of piece on a square via board[square]->type. NOTE: piece[0] is reserved as the EMPTY piece so I assign that piece to empty board slots instead of assigning them as NULL. This way I don't have to do NULL checks before de-referencing board slots. Also keep in mind it's not always necessary to de-reference the pointer contained within a board slot. Sometimes all you need to know is it's relative position within the piece list (e.g. if (board[square] >= FirstSliderPointer) do slider stuff).

The examples shown so far in this conversation have board[square] containing an index within the piece list. Which means you would lookup piece type via piece[board[square]].type. The way I did it makes my head hurt less, but maybe it's slower for some reason - I don't care.

To complete the loop here is the old GenerateMoves routine vs the new routine (in pseudo code of course).

Without piece lists:

Code: Select all

  int pieces = pieceCount[color]; 
  for &#40;int sqr = A1; &#40;pieces > 0&#41; && &#40;sqr <= H8&#41;; ++sqr&#41; &#123; 
    if &#40;sqr & 0x88&#41; &#123; 
      sqr += 7; 
    &#125; 
    else switch &#40;board&#91;sqr&#93;) &#123; 
      case &#40;color|Pawn&#41;&#58;   --pieces; generatePawnMoves&#40;sqr&#41;; break; 
      case &#40;color|Knight&#41;&#58; --pieces; generateKnightMoves&#40;sqr&#41;; break; 
      case &#40;color|Bishop&#41;&#58; --pieces; generateBishopMoves&#40;sqr&#41;; break; 
      case &#40;color|Rook&#41;&#58;   --pieces; generateRookMoves&#40;sqr&#41;; break; 
      case &#40;color|Queen&#41;&#58;  --pieces; generateQueenMoves&#40;sqr&#41;; break; 
      case &#40;color|King&#41;&#58;   --pieces; generateKingMoves&#40;sqr&#41;; break; 
    &#125; 
  &#125; 
With piece lists:

Code: Select all

  for &#40;int i = 0; i < _pcount&#91;color|Pawn&#93;; ++i&#41; &#123;
    sqr = _piece&#91;PawnOffset + i&#93;.sqr;
    generatePawnMoves&#40;sqr&#41;;
  &#125;
  for &#40;int i = 0; i < _pcount&#91;color|Knight&#93;; ++i&#41; &#123;
    sqr = _piece&#91;KnightOffset + i&#93;.sqr;
    generateKnightMoves&#40;sqr&#41;;
  &#125;
  for &#40;int i = 0; i < _pcount&#91;color|Bishop&#93;; ++i&#41; &#123;
    sqr = _piece&#91;BishopOffset + i&#93;.sqr;
    generateBishopMoves&#40;sqr&#41;;
  &#125;
  for &#40;int i = 0; i < _pcount&#91;color|Rook&#93;; ++i&#41; &#123;
    sqr = _piece&#91;RookOffset + i&#93;.sqr;
    generateRookMoves&#40;sqr&#41;;
  &#125;
  for &#40;int i = 0; i < _pcount&#91;color|Queen&#93;; ++i&#41; &#123;
    sqr = _piece&#91;QueenOffset + i&#93;.sqr;
    generateQueenMoves&#40;sqr&#41;;
  &#125;
  sqr = _piece&#91;KingOffset&#93;.sqr;
  generateKingMoves&#40;sqr&#41;;
Optimized code isn't always smaller/cleaner code.
User avatar
hgm
Posts: 27702
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Fast lookup of pieces for move generation without bitboa

Post by hgm »

zd3nik wrote:I must have misunderstood your original description. I thought you implied the fixed offset ranges where each piece type is stored was used to determine piece type (in some cases). But I was indeed a bit skeptical about using that method. As you point out it would require two tests instead of one.
What I meant is that when you want to loop over all pieces of a specific type, you can just loop over that section. In loops you test for the end of the range only, and using the fixed end of the range for that is most efficient. If you want to do incompatibly different things with various piece types, you can then split it in loops over one type that only do the thing that applies to those, relieving you of the need to test inside the loop what piece type you have.

As a side note: by clever assignment of the numbers it can often be arranged that being inside a certain range coincides with certain bits being set, which can be eaisly tested for with just a single test. E.g. Bishops = 4-7, Rooks = 8-11, Queens = 12-15 can test for diagonal sliders as (pieceNr & 8) and for orthogonal sliders as (pieceNr & 4). If it would beimportant to also specifically test for Queens, you could move their codes to 28-31, so that (pieceNr & 16) will tell you if you have a Queen. This of course leaves many holes in the list
I was under the impression that any modern compiler worth spit will turn switch statements on small sets of integer values into an indexed jump rather than a bunch of branches. Something like this:

Code: Select all

goto label&#91;value_as_index&#93;
label_for_value_1&#58;
...
label_for_value_2&#58;
...
etc
Isn't this essentially how switch statements on integer values work? And if so does this still cause occasional pipeline stalls? And if your using compile time constants in the switch statement (which isn't the case in the code in question, but something I often do) the compiler can do it's magic to eliminate all but the matched case.
In any case it almost always causes pipeline stalls. What code it compiles to (branch-tree or indirect jump) alsodepends on whether the range of cases is reasonably contiguous and small, and whether the compiler can know that. It cannot reserve a table of 4G elements (all but 6 pointing to the defined cases, and all others to the 'default' case) when the value to switch on can be an arbitrary integer without any clue to its value. So it would have a limited table, and would have to ensure that the index is not out of bounds (which might require another test and branch).

But the indirect jump almost always will be a mispredict. Two-way branches can be reasonably well predicted as being taken or non-taken, both for branches that have a large preference for one of the two (the CPU keeps a 2-bit counter for each cached branch for that), as well as for branches that are unpredictable in themselves, but correlate very well with branches taken shortly before it. And when the prediction is 'taken', the CPU remembers where it jumped. Indirect branches, however, are always taken, but to different locations all the time. There is no way a CPU is going to remember all the possible destinations, and have some clever algorithm to decide which one is the most likely to occur next, based on the history for this branch or others. The best it can do is predict it will be the same as last time; then it has to remember only one address, in line with the prediction for two-way branches. This catches the sitution where one of the cases dominates very much over the other, but almost certainly is a mispredict when the cases vary. For a tree of two-way branches the CPU would be able to remember the destination of each of them,and try to predict whether it will be used based on the outcome of previous branches, which can cause much lower misprediction rate.
All that being said, I like your suggested alternative to using switch statements. But it would require an almost complete rewrite of my project.
Well, I guess that cannot be helped. You should decide for yourself if that is worth it. Note that you can still speed up the code I gave by making use that the number of directions in which chess pieces move tends to be a multiple of 4, so that you could 'unroll' the loop over directions so that you only test for the end of the list every 4 directions. That would not work for Pawns, but generating Pawn moves needs so many things not needed for pieces (diverging capture/non-capture, conditional double-push, promotion...) that you'd better do it in a separate loop over Pawns. So it definitely pays to have a separate Pawn section in your piece list.

Other tricks that help to speed up code is to combine branches as much as possible. E.g. if the same loop over directions is going to handle both sliders and leapers, you would want to avoid to have an if(slider) part somewhere in the loop. So you could write the loop for sliders, with an inner do-while loop inside it that iterates over the distance travelled. There would be a necessary branch at the while that decides if more steps should be added, based on if the to-square just treated was occupied (or off-board). So it would read something like while(victim == 0) if 0 encodes an empty square. You can piggy-back the test for leaper on that by writing while((victim | isLeaper) == 0), so that it will also not any more steps when the piece is a leaper, even when there is no victim. This just requires an extra OR instruction, and uses the same branch. When pieces can be dichotomized into sliders and leapers, as in orthodox Chess, the isLeaper variable can be set outside the loop over directions.

For mixed leaper-slider pieces, as you have in Capablanca Chess and Shogi, it would have to depend on the move direction you are currently treating. In Fairy-Max this is done by having a second table (next to the board step) that for each direction specifies if it is a leaper or slider (or hopper...) move. In Shokidoki I just do that based on the direction index itself, e.g. isLeaper = (direction & 16) would alternate 16 slider directions with 16 leaper directions in the steps table, and I would align the allowed-move lists of the mixed pieces such that they straddle the boundaries in the right place to declare some of their moves as leaping, and others as sliding.
User avatar
hgm
Posts: 27702
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Fast lookup of pieces for move generation without bitboa

Post by hgm »

zd3nik wrote:For example, my code relies on board slots being set to empty when there's no piece there, so I'm not using the trick of leaving unoccupied board slots as-is.
That would not work, would it? If you would not set the slot empty, the next piece landing on it would think that it captured something. If you could afford to not update the board, there would be no point in having a board at all. The only purpose of the board array is to be able to know what you capture (in terms of piece-list index) with a single memory access. Like the use of the piece list is to get the location (in terms of square number) of every piece in a single access.

Note that writing a 0 (or some other constant) to memory is hardly a performace bottleneck; in a typical program the store unit is idle most of the time, so that a store can almost always be done in parallel with other stuff.
zd3nik
Posts: 193
Joined: Wed Mar 11, 2015 3:34 am
Location: United States

Re: Fast lookup of pieces for move generation without bitboa

Post by zd3nik »

hgm wrote:As a side note: by clever assignment of the numbers it can often be arranged that being inside a certain range coincides with certain bits being set, which can be eaisly tested for with just a single test. E.g. Bishops = 4-7, Rooks = 8-11, Queens = 12-15 can test for diagonal sliders as (pieceNr & 8) and for orthogonal sliders as (pieceNr & 4). If it would beimportant to also specifically test for Queens, you could move their codes to 28-31, so that (pieceNr & 16) will tell you if you have a Queen. This of course leaves many holes in the list
Good idea. Unfortunately my decision to use piece pointers instead of piece numbers in the board array eliminates that possibility in my particular case. But in my particular case there's only one place I every do a lookup of piece type from the board array (if board[square]->type == (color|Knight) ...). In all other cases I'm either starting from the piece list, or testing for sliders (if (board[square] >= FirstSliderOffset) ...).
hgm wrote:In any case it almost always causes pipeline stalls. What code it compiles to (branch-tree or indirect jump) alsodepends on whether the range of cases is reasonably contiguous and small, and whether the compiler can know that. It cannot reserve a table of 4G elements (all but 6 pointing to the defined cases, and all others to the 'default' case) when the value to switch on can be an arbitrary integer without any clue to its value. So it would have a limited table, and would have to ensure that the index is not out of bounds (which might require another test and branch).

But the indirect jump almost always will be a mispredict. Two-way branches can be reasonably well predicted as being taken or non-taken, both for branches that have a large preference for one of the two (the CPU keeps a 2-bit counter for each cached branch for that), as well as for branches that are unpredictable in themselves, but correlate very well with branches taken shortly before it. And when the prediction is 'taken', the CPU remembers where it jumped. Indirect branches, however, are always taken, but to different locations all the time. There is no way a CPU is going to remember all the possible destinations, and have some clever algorithm to decide which one is the most likely to occur next, based on the history for this branch or others. The best it can do is predict it will be the same as last time; then it has to remember only one address, in line with the prediction for two-way branches. This catches the sitution where one of the cases dominates very much over the other, but almost certainly is a mispredict when the cases vary. For a tree of two-way branches the CPU would be able to remember the destination of each of them,and try to predict whether it will be used based on the outcome of previous branches, which can cause much lower misprediction rate.
That makes sense. Thanks for the explanation. Branch prediction has always been a mystery to me, now I at least have an idea of how it works.
hgm wrote:Note that you can still speed up the code I gave by making use that the number of directions in which chess pieces move tends to be a multiple of 4, so that you could 'unroll' the loop over directions so that you only test for the end of the list every 4 directions. That would not work for Pawns, but generating Pawn moves needs so many things not needed for pieces (diverging capture/non-capture, conditional double-push, promotion...) that you'd better do it in a separate loop over Pawns. So it definitely pays to have a separate Pawn section in your piece list.
My move generator is already doing something very similar to this (although doing it precisely the way you suggest would still be a re-write). I didn't want to confuse the issue by elaborating more in my first reply to this suggestion.

I'm using arrays that tell me the end point in each valid direction from a given starting square. For example the array for a bishop on A1 would only contain one end point: H8. On B1 it contains 2 end points: A2 and H7. The end point array for a knight on B1 contains A3, C3, and D2. etc. King moves, knight moves, and all slider moves use this array of end points logic. So all they have to do is expand in the the direction toward each end point until they hit the end point or a piece. King and knight moves will always hit the end point on the first try in each direction. Pawn moves and castling of course use different logic.
zd3nik
Posts: 193
Joined: Wed Mar 11, 2015 3:34 am
Location: United States

Re: Fast lookup of pieces for move generation without bitboa

Post by zd3nik »

hgm wrote:
zd3nik wrote:For example, my code relies on board slots being set to empty when there's no piece there, so I'm not using the trick of leaving unoccupied board slots as-is.
That would not work, would it? If you would not set the slot empty, the next piece landing on it would think that it captured something. If you could afford to not update the board, there would be no point in having a board at all. The only purpose of the board array is to be able to know what you capture (in terms of piece-list index) with a single memory access. Like the use of the piece list is to get the location (in terms of square number) of every piece in a single access.

Note that writing a 0 (or some other constant) to memory is hardly a performace bottleneck; in a typical program the store unit is idle most of the time, so that a store can almost always be done in parallel with other stuff.
Apparently it would work, because that is how it is suggested in the pseudo code on the chessprogramming wiki. https://chessprogramming.wikispaces.com/Piece-Lists but perhaps the extra lines to clear the vacated squares were omitted by mistake or simply to keep the pseudo code as small/simple as possible?

But I swear there was some specific verbiage I read somewhere the specificaly stated it was unnecessary to clear the vacated square. Although now I can't find it.

In any case I agree with you that it seems rather pointless to have a board array if you can't use it to determine whether there's a piece on a given square or not.
User avatar
hgm
Posts: 27702
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Fast lookup of pieces for move generation without bitboa

Post by hgm »

Well, I have been writing something like that for the 'viewDistance board', but the purpose of that data structure was to hold data for occupied squares only. So there,when a square is emptied, the data for that square becomes a don't care,so that you might as well leave the original view distances there. Because this information would be very helpful during UnMake, because it tells you straight away where the nearest occupied neighbors that looking across the square are (whose viewing distance would be cut when the square is re-occupied). So you would want to save and restore the information whenever the square gets occupied again by another piece deeper in the tree, so you can be sure it is still there when the you get to the UnMake for the original move. It is better to postpone that saving to the MakeMove of a later move that occupies it, than doing it straight away on the MakeMove of the move that evacuated it, because it might never get occupied. So saving it pre-emptively might be a waste of time. In particular, in QS empty square never get occupied.

This is the reason for only keeping data on the occupied squares. Updating the viewDistances for a non-capture becomes than a pain, because you land on a square that contains no information at all on what its nearest occupied neighbors are. (At best it contains info on what these neighbors were when it got evacuated, but that could be completely invalid by now.) So you will have to go through the painful process of scanning the board in all directions from the square to get at the nearest neighbors.

But fortunately most moves in the tree are captures, where this problem does not exist. And also keeping info for all the empty squares would require you to update the view distance for all empty squares between A and C when a square C somewhere in between gets evacuated. Which can be pretty expensive when the distances are large and many squares ly in between. And the information might never be used, if no moves deeper in the tree land on that ray (which is sort of guaranteed when only captures are searched).