Split Index Super Set Yielding (SISSY) Bitboards

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 3:57 am

The good news is is that the Chess4.5 classic style bitboards have been completely replaced in RomiChess and it is still running flawlessly. The not bad but not great news is that using the queen code to generate bishop and rook moves is rather slow. About a 300,000 nodes per second reduction. But, when the day is done there is a new bitboard method to explore!

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 6:06 am

I guess the day was not quite at an end, yet. The bishop code was so similar to the queen code that it did not take much effort to write it. And some speed was recovered. Once the rook code is written and the queen made more efficient with 7 lookups instead of the eight that there is now I expect to gain about 400k nodes per second, about a 6% speedup. That is over classic bitboards. Over magic I think only the queen code will be faster. On the other hand experimentation can be done having two bitboard accumulators bb1 and bb2 operating in parallel pipes at the same time. I'm not an expert in doing that but it is worth a try. Another consideration is that my processor only has 12 megs of on chip cache and all this extra data to drive these bitboards may be hindering performance. Someone with one of the new Rysens should test this new bitboard method.

mar · Post by **mar** » Sat Feb 15, 2020 12:33 pm

Hmm, I still have my doubts your queen attacks will outperform fancy magics (14 instructions, 10 lookups),
but I found that gcc and msc have a hard time optimizing your queen attack code (clang gets it).
However, I've found that it's possible to rewrite the code a bit to help the compiler (now gcc generates 32 instructions for your queen attack code):

Code: Select all

uint64_t qss[64][256*8];

uint64_t qatt(int fs, uint64_t aPieces)
{
    uint64_t *tmp = qss[fs];
    return tmp[(aPieces & 255)*8+0] 
                     & tmp[((aPieces >>  (8-3)) & 255*8)+1]
                     & tmp[((aPieces >> (16-3)) & 255*8)+2]
                     & tmp[((aPieces >> (24-3)) & 255*8)+3]
                     & tmp[((aPieces >> (32-3)) & 255*8)+4]
                     & tmp[((aPieces >> (40-3)) & 255*8)+5]
                     & tmp[((aPieces >> (48-3)) & 255*8)+6]
                     & tmp[((aPieces >> (56-3)) & 255*8)+7];
}

I was unable to get it working in C using typedef to keep 3 dimensions instead of 2 so the code now looks a bit cryptic, but I guess the idea is obvious.
This also helps microsoft compiler to generate shorter code.

Here's a link to godbolt: https://godbolt.org/z/QXD-BW

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 2:11 pm

mar wrote: ↑Sat Feb 15, 2020 12:33 pm Hmm, I still have my doubts your queen attacks will outperform fancy magics (14 instructions, 10 lookups),
but I found that gcc and msc have a hard time optimizing your queen attack code (clang gets it).
However, I've found that it's possible to rewrite the code a bit to help the compiler (now gcc generates 32 instructions for your queen attack code):
Code: Select all
uint64_t qss[64][256*8];

uint64_t qatt(int fs, uint64_t aPieces)
{
    uint64_t *tmp = qss[fs];
    return tmp[(aPieces & 255)*8+0] 
                     & tmp[((aPieces >>  (8-3)) & 255*8)+1]
                     & tmp[((aPieces >> (16-3)) & 255*8)+2]
                     & tmp[((aPieces >> (24-3)) & 255*8)+3]
                     & tmp[((aPieces >> (32-3)) & 255*8)+4]
                     & tmp[((aPieces >> (40-3)) & 255*8)+5]
                     & tmp[((aPieces >> (48-3)) & 255*8)+6]
                     & tmp[((aPieces >> (56-3)) & 255*8)+7];
}
I was unable to get it working in C using typedef to keep 3 dimensions instead of 2 so the code now looks a bit cryptic, but I guess the idea is obvious.
This also helps microsoft compiler to generate shorter code.

Here's a link to godbolt: https://godbolt.org/z/QXD-BW

Thanks Martin! Interesting and really cool stuff.

I just have three thoughts. One, can we get it down to just one dimension? Two, there is a caveat about the fact that since the full 8 bit index is used that the & 255 can be done away with. I tried my original union idea and it is faster.

Code: Select all

       case WQUEEN:
        h->moves[id] = qss[fs][occ.b08.rank1][0] 
                     & qss[fs][occ.b08.rank2][1]
                     & qss[fs][occ.b08.rank3][2]
                     & qss[fs][occ.b08.rank4][3]
                     & qss[fs][occ.b08.rank5][4]
                     & qss[fs][occ.b08.rank6][5]
                     & qss[fs][occ.b08.rank7][6]
                     & qss[fs][occ.b08.rank8][7];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;

That would be at least 8 less instructions, right? And three, the optimised (hopefully) idea requires only 7 lookups instead of the eight I initially settled on. It is strange but extracting the rank first allows only 6 more lookups after that even though one of the 6 may be redundant. Very strange!

mar · Post by **mar** » Sat Feb 15, 2020 2:34 pm

Michael Sherwin wrote: ↑Sat Feb 15, 2020 2:11 pm Thanks Martin! Interesting and really cool stuff. I just have three thoughts. One, can we get it down to just one dimension? Two, there is a caveat about the fact that since the full 8 bit index is used that the & 255 can be done away with. I tried my original union idea and it is faster.
Code: Select all
       case WQUEEN:
        h->moves[id] = qss[fs][occ.b08.rank1][0] 
                     & qss[fs][occ.b08.rank2][1]
                     & qss[fs][occ.b08.rank3][2]
                     & qss[fs][occ.b08.rank4][3]
                     & qss[fs][occ.b08.rank5][4]
                     & qss[fs][occ.b08.rank6][5]
                     & qss[fs][occ.b08.rank7][6]
                     & qss[fs][occ.b08.rank8][7];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;
That would be at least 8 less instructions, right? And three, the optimised (hopefully) idea requires only 7 lookups instead of the eight I initially settled on. It is strange but extracting the rank first allows only 6 more lookups after that even though one of the 6 may be redundant. Very strange!

For your union idea gcc generates 44 instructions (12 more). Of course counting instructions is not so great, because you have to take into account what instructions are executed and how they're scheduled, I've seen code with "more instructions" actually running faster.
(but if a compiler generates less instructions then that's usually a good sign).

So even if you think you're helping the compiler, it will still use shifts+masking instead of extra 8 x byte-lookups.

I don't think a single array would help, you still need to compute the initial index using from square (and all compilers will "get" that to optimize the lookup). The changes I posted do exactly that - just cache pointer then everything else is a 1d-lookup.

What matters at the end of the day is performance so all this would have to be measured first.

One less lookup seems tempting, do you have the code with this for queen attacks?

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 3:09 pm

mar wrote: ↑Sat Feb 15, 2020 2:34 pm
Michael Sherwin wrote: ↑Sat Feb 15, 2020 2:11 pm Thanks Martin! Interesting and really cool stuff. I just have three thoughts. One, can we get it down to just one dimension? Two, there is a caveat about the fact that since the full 8 bit index is used that the & 255 can be done away with. I tried my original union idea and it is faster.
Code: Select all
       case WQUEEN:
        h->moves[id] = qss[fs][occ.b08.rank1][0] 
                     & qss[fs][occ.b08.rank2][1]
                     & qss[fs][occ.b08.rank3][2]
                     & qss[fs][occ.b08.rank4][3]
                     & qss[fs][occ.b08.rank5][4]
                     & qss[fs][occ.b08.rank6][5]
                     & qss[fs][occ.b08.rank7][6]
                     & qss[fs][occ.b08.rank8][7];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;
That would be at least 8 less instructions, right? And three, the optimised (hopefully) idea requires only 7 lookups instead of the eight I initially settled on. It is strange but extracting the rank first allows only 6 more lookups after that even though one of the 6 may be redundant. Very strange!
For your union idea gcc generates 44 instructions (12 more). Of course counting instructions is not so great, because you have to take into account what instructions are executed and how they're scheduled, I've seen code with "more instructions" actually running faster.
(but if a compiler generates less instructions then that's usually a good sign).

So even if you think you're helping the compiler, it will still use shifts+masking instead of extra 8 x byte-lookups.

I don't think a single array would help, you still need to compute the initial index using from square (and all compilers will "get" that to optimize the lookup). The changes I posted do exactly that - just cache pointer then everything else is a 1d-lookup.

What matters at the end of the day is performance so all this would have to be measured first.

One less lookup seems tempting, do you have the code with this for queen attacks?

Here is the code for 7 lookups.

Code: Select all

u64 QueenAttacks(u08 sq, u64 occ) {
u64 bb = rnk[sq][(occ >> (sq & 56)) & 255];
return bb &
  qss[sq][(occ >>  8) & 255][0] &
  qss[sq][(occ >> 16) & 255][1] &
  qss[sq][(occ >> 24) & 255][2] &
  qss[sq][(occ >> 32) & 255][3] &
  qss[sq][(occ >> 40) & 255][4] &
  qss[sq][(occ >> 48) & 255][5];
}

mar · Post by **mar** » Sat Feb 15, 2020 3:19 pm

Michael Sherwin wrote: ↑Sat Feb 15, 2020 3:09 pm Here is the code for 7 lookups.

Code: Select all

u64 QueenAttacks(u08 sq, u64 occ) {
u64 bb = rnk[sq][(occ >> (sq & 56)) & 255];
return bb &
  qss[sq][(occ >>  8) & 255][0] &
  qss[sq][(occ >> 16) & 255][1] &
  qss[sq][(occ >> 24) & 255][2] &
  qss[sq][(occ >> 32) & 255][3] &
  qss[sq][(occ >> 40) & 255][4] &
  qss[sq][(occ >> 48) & 255][5];
}

Thanks. Could you share the initialization code please?
In the meantime, I can confirm that the queen attacks using 8 lookup do indeed work.
I tried to reuse the queen attacks for bishop and rooks using extra pseudo-attack mask lookup, but that's probably wasteful.
In cheng, I got 12.6% slowdown total compared to fancy magics (I only tried the microsoft compiler and the "optimization" using cached temp I proposed)
However, only 0.6% slowdown vs magics when used for queens only, but that was just a startpos to depth 20.
The problem is probably that fancy magics have half the cost for bishops/rooks compared to queens (if we neglect memory access)

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 3:48 pm

mar wrote: ↑Sat Feb 15, 2020 3:19 pm
Michael Sherwin wrote: ↑Sat Feb 15, 2020 3:09 pm Here is the code for 7 lookups.
Code: Select all
u64 QueenAttacks(u08 sq, u64 occ) {
u64 bb = rnk[sq][(occ >> (sq & 56)) & 255];
return bb &
  qss[sq][(occ >>  8) & 255][0] &
  qss[sq][(occ >> 16) & 255][1] &
  qss[sq][(occ >> 24) & 255][2] &
  qss[sq][(occ >> 32) & 255][3] &
  qss[sq][(occ >> 40) & 255][4] &
  qss[sq][(occ >> 48) & 255][5];
}
Thanks. Could you share the initialization code please?
In the meantime, I can confirm that the queen attacks using 8 lookup do indeed work.
I tried to reuse the queen attacks for bishop and rooks using extra pseudo-attack mask lookup, but that's probably wasteful.
In cheng, I got 12.6% slowdown total compared to fancy magics (I only tried the microsoft compiler and the "optimization" using cached temp I proposed)
However, only 0.6% slowdown vs magics when used for queens only, but that was just a startpos to depth 20.
The problem is probably that fancy magics have half the cost for bishops/rooks compared to queens (if we neglect memory access)

I have not done the initialization code yet for the rnk array. That is why I settled on the eight lookups just to get something working. I will try to do that initialization today. Thanks for verifying that the eight lookup version works. That means a lot to me that you were willing to do that! A .6% slowdown for the queen leaves a ray of hope,

. Any further optimization at all could easily erase that deficit. Here is the bishop initialization and function.

Code: Select all

void InitializeB() {
  u08 sq, sqr, i, j, k, l;
  s08 x, dx, y, dy;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      if(i ^ 1) {
        j = i >> 1;
        for (k = 8, l = 0; k <= 48; k += 8, l++) {
          bb = 0;
          b = (u64)i << k;
          for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          bss[sq][j][l] = bb;
        }
      }
    }
  }
}

      case WBISHOP:
        h->moves[id] = bss[fs][(aPieces >>  9) & 63][0] 
                     & bss[fs][(aPieces >> 17) & 63][1]
                     & bss[fs][(aPieces >> 25) & 63][2]
                     & bss[fs][(aPieces >> 33) & 63][3]
                     & bss[fs][(aPieces >> 41) & 63][4]
                     & bss[fs][(aPieces >> 49) & 63][5];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;

mar · Post by **mar** » Sat Feb 15, 2020 4:21 pm

Michael Sherwin wrote: ↑Sat Feb 15, 2020 3:48 pm I have not done the initialization code yet for the rnk array. That is why I settled on the eight lookups just to get something working. I will try to do that initialization today. Thanks for verifying that the eight lookup version works. That means a lot to me that you were willing to do that! A .6% slowdown for the queen leaves a ray of hope, . Any further optimization at all could easily erase that deficit. Here is the bishop initialization and function.

Thanks, bishop SISSY works well, now down to 8.6% slower than fancy magics (note: total time, not the attack generator itself)
So it's already on par with what I tried recently (2 bsrs) and it seems very likely that "plain SISSY" without rnk table will already outperform that.
Impressive, looking forward to try rooks and ultimately the final rnk-SISSY

9.1% slower when using mingw, but that's because it makes the magic version itself faster than msc, still not bad at all

HMM I forgot to leave the extra psudeo-bishop masking after lookup, but surprisingly removing this made microsoft compiler generate slower code

so now it's 9.9% msc and 9.0% gcc

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 5:53 pm

mar wrote: ↑Sat Feb 15, 2020 4:21 pm
Michael Sherwin wrote: ↑Sat Feb 15, 2020 3:48 pm I have not done the initialization code yet for the rnk array. That is why I settled on the eight lookups just to get something working. I will try to do that initialization today. Thanks for verifying that the eight lookup version works. That means a lot to me that you were willing to do that! A .6% slowdown for the queen leaves a ray of hope, . Any further optimization at all could easily erase that deficit. Here is the bishop initialization and function.
Thanks, bishop SISSY works well, now down to 8.6% slower than fancy magics (note: total time, not the attack generator itself)
So it's already on par with what I tried recently (2 bsrs) and it seems very likely that "plain SISSY" without rnk table will already outperform that.
Impressive, looking forward to try rooks and ultimately the final rnk-SISSY
9.1% slower when using mingw, but that's because it makes the magic version itself faster than msc, still not bad at all

HMM I forgot to leave the extra psudeo-bishop masking after lookup, but surprisingly removing this made microsoft compiler generate slower code
so now it's 9.9% msc and 9.0% gcc

I have the rnk queen version working. I have not tested it as thoroughly so I only hope it is working correctly. More testing will tell. I will begin on the rooks. And thanks for the encouragement! Also have not made the qss table smaller yet. I don't know if that will make a difference or not.

Code: Select all

void InitializeRNK() {
  u08 sq, sqr, i, j;
  s08 x, dx, y, dy;
  u64 bb, b;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for(i = 0; i < 128; i++) {
      if(i ^ 1) {
        j = i >> 1;
        bb = 0;
        b = (u64)i << (sq & 56);
        for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = -1; x + dx > -1; dx--) {
          sqr = (y << 3) + x + dx;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = +1; x + dx < +8; dx++) {
          sqr = (y << 3) + x + dx;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dy = +1; y + dy < +8; dy++) {
          sqr = ((y + dy) << 3) + x;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dy = -1; y + dy > -1; dy--) {
          sqr = ((y + dy) << 3) + x;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        rnk[sq][j] = bb;
      }
    }
  }
}

       case WQUEEN:
        h->moves[id] = rnk[fs][(aPieces >> ((fs & 56) + 1)) & 63] 
                     & qss[fs][(aPieces >>  8) & 255][1]
                     & qss[fs][(aPieces >> 16) & 255][2]
                     & qss[fs][(aPieces >> 24) & 255][3]
                     & qss[fs][(aPieces >> 32) & 255][4]
                     & qss[fs][(aPieces >> 40) & 255][5]
                     & qss[fs][(aPieces >> 48) & 255][6];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;

Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards