bitboards like mailbox

Mike Sherwin · Post by **Mike Sherwin** » Sat Nov 13, 2021 8:30 pm

tcusr wrote: ↑Sat Nov 13, 2021 8:20 pm
Mike Sherwin wrote: ↑Sat Nov 13, 2021 8:14 pm
tcusr wrote: ↑Sat Nov 13, 2021 4:05 pm sorry to revive this thread but i don't know where to ask these stupid questions.
1) what is the kogge stone algorithm used for?
out of all the engines i've read none of them uses this algorithm, only https://github.com/ankan-ban/perft_gpu seems to use it but only for test purposes, i think. which is not even a good idea since gerd said there are faster techniques to generate attacks on the fly from a single square

2) apart from a hand crafted eval, where do bitboards really prove to be better than a mailbox representation?
pruning/move ordering techniques do not the depend on the board representation so why should i use them when i only use a tapered eval with PSQT?

thanks in advance
Let's say the moves are being generated for a queen that attacks the enemy king. In mailbox that may not be discovered until the very last move is generated and stored in a move list. In bitboard as soon you have the bits it can be tested and no queen moves will be added. In RomiChess during the normal search when moves are generated, only the bits are generated and stored. Only after the ht move, promotions, winning captures, killers etc are the remaining bits spun into a move list. Keeping and using the bits makes for for fast verification that a ht or killer move etc are legal. Most of the time no move list is even created. Etc.
is this some kind of attack table with a staged move generation? it's too advanced for me now

Not at all it is just delaying spinning the bitboards into a move list. Some code from RomiChess.

Code: Select all

       case WQUEEN:
        h->moves[id] = ((ray7p[fs] & belowInc[FirstBit(ray7p[fs] & aPieces)])
                     |  (ray9p[fs] & belowInc[FirstBit(ray9p[fs] & aPieces)])
                     |  (ray7m[fs] & aboveInc[LastBit(ray7m[fs] & aPieces)])
                     |  (ray9m[fs] & aboveInc[LastBit(ray9m[fs] & aPieces)])
                     |  (ray8p[fs] & belowInc[FirstBit(ray8p[fs] & aPieces)])
                     |  (ray1p[fs] & belowInc[FirstBit(ray1p[fs] & aPieces)])
                     |  (ray8m[fs] & aboveInc[LastBit(ray8m[fs] & aPieces)])
                     |  (ray1m[fs] & aboveInc[LastBit(ray1m[fs] & aPieces)]));
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;

// Then from search

    case HASHMOVE1:
      h->phase = PROMOTE;
      if(h->p && h->p->typ != NONE) {
        m.m = h->p->m;
        switch(h->p->typ) {
          case WCASKS:
            id = FirstBit(WCASKbit);
            if(h->moves[id] & WCASKbit) {
              h->moves[id] = 0;
              h->node->m = h->p->m;
              (h+1)->t = h->node + 1;
              return TRUE;
            } 
            break;
...

    case ADDMOVES:
      h->phase = NEXTMOVE;
      AddMoves(h->node, depth);
      if(h->node == (h+1)->t) return FALSE;
      if(depth > 3) {
        for(node = h->node; node < (h+1)->t; node++) {
          Sort(node, (h+1)->t);
          MakeMove((moves *)&node->m);
          if(GetReps())
            node->score = 0;
          else {
            inShort++;
            node->score = -Search(-beta - 180, -alpha + 20, depth > 7 ? 3 : 1, extendBy);
            inShort--;
          }
          ClrReps();
          TakeBack();
        }
      }

Edit: Well, yes it is staged but not in a complicated way.

tcusr · Post by **tcusr** » Sat Nov 13, 2021 8:50 pm

Mike Sherwin wrote: ↑Sat Nov 13, 2021 8:30 pm
tcusr wrote: ↑Sat Nov 13, 2021 8:20 pm
Mike Sherwin wrote: ↑Sat Nov 13, 2021 8:14 pm
tcusr wrote: ↑Sat Nov 13, 2021 4:05 pm sorry to revive this thread but i don't know where to ask these stupid questions.
1) what is the kogge stone algorithm used for?
out of all the engines i've read none of them uses this algorithm, only https://github.com/ankan-ban/perft_gpu seems to use it but only for test purposes, i think. which is not even a good idea since gerd said there are faster techniques to generate attacks on the fly from a single square

2) apart from a hand crafted eval, where do bitboards really prove to be better than a mailbox representation?
pruning/move ordering techniques do not the depend on the board representation so why should i use them when i only use a tapered eval with PSQT?

thanks in advance
Let's say the moves are being generated for a queen that attacks the enemy king. In mailbox that may not be discovered until the very last move is generated and stored in a move list. In bitboard as soon you have the bits it can be tested and no queen moves will be added. In RomiChess during the normal search when moves are generated, only the bits are generated and stored. Only after the ht move, promotions, winning captures, killers etc are the remaining bits spun into a move list. Keeping and using the bits makes for for fast verification that a ht or killer move etc are legal. Most of the time no move list is even created. Etc.
is this some kind of attack table with a staged move generation? it's too advanced for me now
Not at all it is just delaying spinning the bitboards into a move list. Some code from RomiChess.
Code: Select all
       case WQUEEN:
        h->moves[id] = ((ray7p[fs] & belowInc[FirstBit(ray7p[fs] & aPieces)])
                     |  (ray9p[fs] & belowInc[FirstBit(ray9p[fs] & aPieces)])
                     |  (ray7m[fs] & aboveInc[LastBit(ray7m[fs] & aPieces)])
                     |  (ray9m[fs] & aboveInc[LastBit(ray9m[fs] & aPieces)])
                     |  (ray8p[fs] & belowInc[FirstBit(ray8p[fs] & aPieces)])
                     |  (ray1p[fs] & belowInc[FirstBit(ray1p[fs] & aPieces)])
                     |  (ray8m[fs] & aboveInc[LastBit(ray8m[fs] & aPieces)])
                     |  (ray1m[fs] & aboveInc[LastBit(ray1m[fs] & aPieces)]));
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;

// Then from search

    case HASHMOVE1:
      h->phase = PROMOTE;
      if(h->p && h->p->typ != NONE) {
        m.m = h->p->m;
        switch(h->p->typ) {
          case WCASKS:
            id = FirstBit(WCASKbit);
            if(h->moves[id] & WCASKbit) {
              h->moves[id] = 0;
              h->node->m = h->p->m;
              (h+1)->t = h->node + 1;
              return TRUE;
            } 
            break;
...

    case ADDMOVES:
      h->phase = NEXTMOVE;
      AddMoves(h->node, depth);
      if(h->node == (h+1)->t) return FALSE;
      if(depth > 3) {
        for(node = h->node; node < (h+1)->t; node++) {
          Sort(node, (h+1)->t);
          MakeMove((moves *)&node->m);
          if(GetReps())
            node->score = 0;
          else {
            inShort++;
            node->score = -Search(-beta - 180, -alpha + 20, depth > 7 ? 3 : 1, extendBy);
            inShort--;
          }
          ClrReps();
          TakeBack();
        }
      }
Edit: Well, yes it is staged but not in a complicated way.

nice! i don't have a very complicated search function, i just generate captures, sort them with mvv/lva, generate quiets (random order) and then go as usual. in my new engine i want to implement an ID loop and history so i'll get a better ordering, but first i have to solve this move generation problem.
btw i think i found a solution. to generate quiet moves i will do something like this

Code: Select all

for (rooks = getRooks(); rooks; rooks &= rooks - 1) {
  U64 r = rooks & -rooks;
  // i have a template function for each direction, it considers wraps
  while ((r <<= 8) & empty) addMove(...);
}

then whenever i need a target mask i will use the KS algo (i know that for a single bit it's a waste but i decided that i don't care)

now that i think about it can use KS to generate ALL attacked square by rooks/bishops/queens and use them in my 1) is_king_in_check function and 2) to find pinned pieces which will be useful for my eval (like olithink) and move generation for some pieces (pawns pinned horizontally and knights)

dangi12012 · Post by **dangi12012** » Sat Nov 13, 2021 9:10 pm

tcusr wrote: ↑Sat Nov 13, 2021 8:50 pm
Code: Select all
for (rooks = getRooks(); rooks; rooks &= rooks - 1) {
  U64 r = rooks & -rooks;
  // i have a template function for each direction, it considers wraps
  while ((r <<= 8) & empty) addMove(...);
}
then whenever i need a target mask i will use the KS algo (i know that for a single bit it's a waste but i decided that i don't care)

now that i think about it can use KS to generate ALL attacked square by rooks/bishops/queens and use them in my 1) is_king_in_check function and 2) to find pinned pieces which will be useful for my eval (like olithink) and move generation for some pieces (pawns pinned horizontally and knights)

This looks nice. Think about using the intrsinics for optimal speed - because not all compilers can figure out that X &= X- 1 has AVX2 op for that.

Code: Select all

#define Bitloop(X) for(;X; X = _blsr_u64(X))

If you go for generic C++ I still recommend keeping a macro for _blsr_u64 and if unsupported: X &= X- 1
Can you share your direction template?
I am very certain that this can be fully inlined with a switch on the square like: _tzcnt_u64
The output would look like the code in this:
http://www.talkchess.com/forum3/viewtop ... t=espresso

I didnt see the while loop for addmove before this could be interesting performance wise because you are not bulding a uint64_t but directly enumerating inside a loop which is as fast as its possible.
Good work!

tcusr · Post by **tcusr** » Sat Nov 13, 2021 9:56 pm

dangi12012 wrote: ↑Sat Nov 13, 2021 9:10 pm
tcusr wrote: ↑Sat Nov 13, 2021 8:50 pm
Code: Select all
for (rooks = getRooks(); rooks; rooks &= rooks - 1) {
  U64 r = rooks & -rooks;
  // i have a template function for each direction, it considers wraps
  while ((r <<= 8) & empty) addMove(...);
}
then whenever i need a target mask i will use the KS algo (i know that for a single bit it's a waste but i decided that i don't care)

now that i think about it can use KS to generate ALL attacked square by rooks/bishops/queens and use them in my 1) is_king_in_check function and 2) to find pinned pieces which will be useful for my eval (like olithink) and move generation for some pieces (pawns pinned horizontally and knights)
This looks nice. Think about using the intrsinics for optimal speed - because not all compilers can figure out that X &= X- 1 has AVX2 op for that.
Code: Select all
#define Bitloop(X) for(;X; X = _blsr_u64(X))
If you go for generic C++ I still recommend keeping a macro for _blsr_u64 and if unsupported: X &= X- 1
Can you share your direction template?
I am very certain that this can be fully inlined with a switch on the square like: _tzcnt_u64
The output would look like the code in this:
http://www.talkchess.com/forum3/viewtop ... t=espresso

I didnt see the while loop for addmove before this could be interesting performance wise because you are not bulding a uint64_t but directly enumerating inside a loop which is as fast as its possible.
Good work!

thank you!!
i don't care about this micro optimizations (at the moment), on my 8 year old laptop X &= X - 1 compiles to blsr anyway (g++).
addMove just adds the destination square to the move list because i want to write an AB chess engine, from what i understood you want a MCTS so instead of adding the move you can immediately play it and call it recursively. it does not seem a bad idea

btw shift just branchlessly shifts left/right depending on the directions sign (north = 8, south = -8, east = 1, west = -1) and then, whether it's going to east or west it applies the correct mask (~file_a for east, ~file_h for west)
what do you mean by inlining it with _tzcnt_u64? i use that (std::countr_zero in C++20) just to find the "to" square after it's been shifted

dangi12012 · Post by **dangi12012** » Sat Nov 13, 2021 10:40 pm

tcusr wrote: ↑Sat Nov 13, 2021 9:56 pm
dangi12012 wrote: ↑Sat Nov 13, 2021 9:10 pm
tcusr wrote: ↑Sat Nov 13, 2021 8:50 pm
Code: Select all
for (rooks = getRooks(); rooks; rooks &= rooks - 1) {
  U64 r = rooks & -rooks;
  // i have a template function for each direction, it considers wraps
  while ((r <<= 8) & empty) addMove(...);
}
then whenever i need a target mask i will use the KS algo (i know that for a single bit it's a waste but i decided that i don't care)

now that i think about it can use KS to generate ALL attacked square by rooks/bishops/queens and use them in my 1) is_king_in_check function and 2) to find pinned pieces which will be useful for my eval (like olithink) and move generation for some pieces (pawns pinned horizontally and knights)
This looks nice. Think about using the intrsinics for optimal speed - because not all compilers can figure out that X &= X- 1 has AVX2 op for that.
Code: Select all
#define Bitloop(X) for(;X; X = _blsr_u64(X))
If you go for generic C++ I still recommend keeping a macro for _blsr_u64 and if unsupported: X &= X- 1
Can you share your direction template?
I am very certain that this can be fully inlined with a switch on the square like: _tzcnt_u64
The output would look like the code in this:
http://www.talkchess.com/forum3/viewtop ... t=espresso

I didnt see the while loop for addmove before this could be interesting performance wise because you are not bulding a uint64_t but directly enumerating inside a loop which is as fast as its possible.
Good work!
thank you!!
i don't care about this micro optimizations (at the moment), on my 8 year old laptop X &= X - 1 compiles to blsr anyway (g++).
addMove just adds the destination square to the move list because i want to write an AB chess engine, from what i understood you want a MCTS so instead of adding the move you can immediately play it and call it recursively. it does not seem a bad idea

btw shift just branchlessly shifts left/right depending on the directions sign (north = 8, south = -8, east = 1, west = -1) and then, whether it's going to east or west it applies the correct mask (~file_a for east, ~file_h for west)
what do you mean by inlining it with _tzcnt_u64? i use that (std::countr_zero in C++20) just to find the "to" square after it's been shifted

Yes i mean std::countr_zero but on the rooks. This will give you an int for the square in 0...63 form.
This can be a parameter for a function which knows how many steps you can do in one direction for that square.
For example square 0 would only have positive rays - and you can unroll all the loops in a template.
Then you dont need masks at all - and no while loop etc. It will be if() makemove else if () makemove etc. repeated exactly right

Just some things that came into my mind. But as always - dont optimize too early. Implement your functionality FULLY first.

tcusr · Post by **tcusr** » Sat Nov 13, 2021 10:57 pm

it's a nice idea (which follows your code style

) but i hate long compile times and maximum speed is not what i am looking for.
in a few days (i hope) i'll come back with the perft results with this hybrid move generator!

PS. i had similar idea to branchlessly add pawn promotions. you could use the pawns on the 7th rank as the index (after shifting them to the first rank) of an array of functions that just add the correct number of pawns to the move list.
but considering that promotions are rare this is overly complicated for no reason and i gave up. this could also be applied for double pawn pushes and castling though

tcusr · Post by **tcusr** » Sun Nov 14, 2021 7:13 pm

Gerd Isenberg wrote: ↑Mon Nov 08, 2021 8:56 pm
tcusr wrote: ↑Mon Nov 08, 2021 5:19 pm UPDATE:
while i was getting frustrated writing this new move generator i remembered that i still have my old engine which uses magic bitboards, why not run some tests?
i decided to do search on kiwipete at depth 8, but with 2 generators:
1) fancy magic bitboard (with a table of 800kb)
2) kogge stone generator

these are the results (the engine uses only mvv/lva for ordering and psqt tables for evaluation, so it's pretty lightweight)
magic bitboard: 6.2 seconds with 2.95 Mnodes/sec
kogge stone: 8.3 seconds with 2.2 Mnodes/sec

tbh i expected worse, even though in all the other tests kogge stone was always slower, on higher depths the difference was of, at most, 3 seconds.
in my generator i plan to use the kogge stone only in the generation of captures, so in the end it will be a tiny bit faster than pure kogge stone
Kogge-Stone is intended for a direction wise approach (similar to set-wise pawn attacks left and right) of multiple sliding pieces, i.e. rooks and queens for orthogonal and bishops and queens for diagonal directions. Otherwise for single pieces, there are faster approaches also with low memory consumption.

hi gerd, i have a question for you (i'm replying to this post you can get a notification

)
do you have the source code for dirgolem? i'd like to see how you use SIMD instructions in your move generator

i have never written SIMD code so i don't even know where to start... i want to write portable code and i think i can use gcc/clang vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vect ... sions.html) or std::experimental::simd (probably not though, there's not even documentation for this).

i have a vague idea on how i can use them, does this look reasonable? idek if it's correct

Code: Select all

// vec8 is a vector of 8 U64
vec8 northAttacks; // to be filled with bitboards i want to move
northAttacks[0] = rooks & -rooks;
northAttacks[1] = rooks &= rooks - 1;
// and so on with queens and king and pawns (only pieces that go north)
nortAttacks <<= 8;

dangi12012 · Post by **dangi12012** » Sun Nov 14, 2021 7:59 pm

tcusr wrote: ↑Sun Nov 14, 2021 7:13 pm i have never written SIMD code so i don't even know where to start... i want to write portable code and i think i can use gcc/clang vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vect ... sions.html) or std::experimental::simd (probably not though, there's not even documentation for this).

Use Agner Fogs very excellent vector class library. It will compile portable - and you get perfect SIMD instructions if its supported with operator overloading.
https://www.agner.org/optimize/vcl_manual.pdf

If you dont want to use excellent libraries and start from scratch I recommend this table:
https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf

I recommend the library - as there are many stumbling blocks with simd to get it right and you can write code like you normally would - just prepare to get used to thinking in bulk elements of 4/8.
To find out what I mean its best to just read the first link:

Agner Fog wrote:"Permute, blend, and gather functions. There are many different machine instructions that move
data between different vector elements. Some of these instructions can only do very specific
data permutations. The VCL uses quite a lot of metaprogramming to find the instruction or
sequence of instructions that best fits the specific permutation pattern specified. Often, the
higher instruction sets give more efficient results"

tcusr · Post by **tcusr** » Sun Nov 14, 2021 8:50 pm

dangi12012 wrote: ↑Sun Nov 14, 2021 7:59 pm
tcusr wrote: ↑Sun Nov 14, 2021 7:13 pm i have never written SIMD code so i don't even know where to start... i want to write portable code and i think i can use gcc/clang vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vect ... sions.html) or std::experimental::simd (probably not though, there's not even documentation for this).
Use Agner Fogs very excellent vector class library. It will compile portable - and you get perfect SIMD instructions if its supported with operator overloading.
https://www.agner.org/optimize/vcl_manual.pdf

If you dont want to use excellent libraries and start from scratch I recommend this table:
https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf

I recommend the library - as there are many stumbling blocks with simd to get it right and you can write code like you normally would - just prepare to get used to thinking in bulk elements of 4/8.
To find out what I mean its best to just read the first link:
Agner Fog wrote:"Permute, blend, and gather functions. There are many different machine instructions that move
data between different vector elements. Some of these instructions can only do very specific
data permutations. The VCL uses quite a lot of metaprogramming to find the instruction or
sequence of instructions that best fits the specific permutation pattern specified. Often, the
higher instruction sets give more efficient results"

thanks!
i prefer not to use external libraries when i have native options from gcc/clang or (in the future) by the C++ std library...

dangi12012 · Post by **dangi12012** » Sun Nov 14, 2021 9:20 pm

tcusr wrote: ↑Sun Nov 14, 2021 8:50 pm
dangi12012 wrote: ↑Sun Nov 14, 2021 7:59 pm
tcusr wrote: ↑Sun Nov 14, 2021 7:13 pm i have never written SIMD code so i don't even know where to start... i want to write portable code and i think i can use gcc/clang vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vect ... sions.html) or std::experimental::simd (probably not though, there's not even documentation for this).
Use Agner Fogs very excellent vector class library. It will compile portable - and you get perfect SIMD instructions if its supported with operator overloading.
https://www.agner.org/optimize/vcl_manual.pdf

If you dont want to use excellent libraries and start from scratch I recommend this table:
https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf

I recommend the library - as there are many stumbling blocks with simd to get it right and you can write code like you normally would - just prepare to get used to thinking in bulk elements of 4/8.
To find out what I mean its best to just read the first link:
Agner Fog wrote:"Permute, blend, and gather functions. There are many different machine instructions that move
data between different vector elements. Some of these instructions can only do very specific
data permutations. The VCL uses quite a lot of metaprogramming to find the instruction or
sequence of instructions that best fits the specific permutation pattern specified. Often, the
higher instruction sets give more efficient results"

thanks!
i prefer not to use external libraries when i have native options from gcc/clang or (in the future) by the C++ std library...

I like a coding style thats portable and also compiles with visual studio. Then I recommend for you to write wrapper classes with correct overloads and fallbacks. You will learn a lot and the code stays maintainable and readable. Pure immintrin.h implementations in C++ were never intended by intel and are really and truly unreadable. It will be the same speed since C++ is a zero overhead abstraction language.

bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox

Re: bitboards like mailbox