Split Index Super Set Yielding (SISSY) Bitboards

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 8:07 pm

mar wrote: ↑Mon Feb 17, 2020 7:15 pm
Michael Sherwin wrote: ↑Mon Feb 17, 2020 5:40 pm @Martin - If you haven't given up yet on SISSY then I have the following code I'd like you to try. If you have given up then what is your conclusion? I'm kinda hanging out in the wind here.
I haven't given up Michael, I'll look at it in the coming days, no worries

I'm glad that you have not given up!

About the move eax, eax I do not see right off how it would be useful but I will give it some thought. Thanks!

mar · Post by **mar** » Mon Feb 17, 2020 8:25 pm

Michael Sherwin wrote: ↑Mon Feb 17, 2020 8:07 pm About the move eax, eax I do not see right off how it would be useful but I will give it some thought. Thanks!

Of course, that was a bad example

What I meant was that add eax,edx will clear upper 32 bits of rax, the same applies to mov eax,edx and so on

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 11:33 pm

What's a Creel? on his YouTube channel featured a benchmark that he wrote to test various assembly language instructions. Bottom line for this thread is avoid shift instructions on intel as much as possible. His base machine was an 11 year old Phenom II which he normalized to 1.0. An i9-9900k running at 5ghz on all 16 threads only did 0.7731 Phenom II 810 quad core with 4 threads.

The benchmark: https://github.com/TheRainDoodle/Phenom-II-Benchmark

mar · Post by **mar** » Thu Feb 20, 2020 7:59 pm

So what's the current status?
I'm not going to try any assembly, I think your last improvement was to reuse part of qss for the rnk table?
Also I think you were talking about reducing the number of lookups for bishops as well?

Michael Sherwin · Post by **Michael Sherwin** » Fri Feb 21, 2020 1:26 am

mar wrote: ↑Thu Feb 20, 2020 7:59 pm So what's the current status?
I'm not going to try any assembly, I think your last improvement was to reuse part of qss for the rnk table?
Also I think you were talking about reducing the number of lookups for bishops as well?

I have been a bit under the weather for the last few days so I have not done much. Was -22 degrees fahrenheit (-30c)here this morning! My thoughts about folding the bishop is to shift down the upper 32 bits by 24 after masking out the non relative bits. That will make 3 ranks of relevant bits instead of 6. Then the initialization will have to continue with merging the upper suppersets with the lower supersets. All that would save a little time but how much idk.

You mentioned cache aligning so I am very interested in those figures combined with the rnk included improvement. That would tell me a lot. Also the Clang timings. I don't think the timings you gave previously included the better Clang code.

Michael Sherwin · Post by **Michael Sherwin** » Fri Feb 21, 2020 6:33 pm

NVM. I have decided to implement SISSY bitboards within a SearchOnePlyQ() function written entirely in x64 assembler. That method of move generation would have several advantages. It allows the efficiency of pseudo legal move generation that nonetheless produces a legal list of moves each of which has a score from the Eval() function. Assembler would allow the use of true jump table code which C/C++ just can't generate. The most time consuming part of the search at the leaves would totally be in one very efficient function. It can be programmed as a DLL (.dll) or an SLL (.lib) file that others could use in their projects. This is what I want to do. This is what makes sense to me at this time.

Mike Sherwin · Post by **Mike Sherwin** » Sat Feb 13, 2021 12:57 pm

I'd like to revisit SISSY bitboards. I think that they can be highly optimised just using C without resorting to assembly. Just considering the bishop for now. Code from the Chess Programming Wiki.

Code: Select all

U64 bss[6][64][64]; // 192 K

U64 bishopAttacks(U64 occ, enumSquare sq) {
  return
    bss[0][sq][(occ >>  9) & 63] &  // 2nd rank
    bss[1][sq][(occ >> 17) & 63] &  
    bss[2][sq][(occ >> 25) & 63] &  
    bss[3][sq][(occ >> 33) & 63] &  
    bss[4][sq][(occ >> 41) & 63] &  
    bss[5][sq][(occ >> 49) & 63] ;  // 7th rank
}

I agree that is a little expensive. HOWEVER, that can be reduced to only two lookups! First shift the upper 32 bits of the occupancy up by 8 bits. Then and it with the lower 32 bits and then shift the 4th rank down by either 8 or 16 bits depending on the from square. That will produce a split index between the 2nd and third ranks. That is a whole lot of saved shifting and anding. The problem I'm having is the initializing everything into two ranks. I have correct working code that initializes for 6 ranks. Now I just need to do a second level of combining those values into 2 ranks.

Code: Select all

void InitializeB() {
  u08 sq, sqr, i, j, k, l;
  s08 x, dx, y, dy;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      if (i ^ 1) {
        j = i >> 1;
        for (k = 8, l = 0; k <= 48; k += 8, l++) {
          bb = 0;
          b = (u64)i << k;
          for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= ONE << sqr;
            if ((ONE << sqr) & b) break;
          }
          bss[sq][j][l] = bb;
        }
      }
    }
  }
}

And as can be seen the initialisation to 6 ranks is not that long nor very difficult. But further reducing them down to two ranks is giving me too much trouble. Mar showed that SISSY was not that far away from magic to begin with. Isn't it worth further investigation?

The queen is also extremely promising. Here is my code from Bricabrac. Perft numbers are accurate so the accuracy has been verified in Bricabrac and by Mar.

Code: Select all

   case WQ:
    case BQ:
      bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
        & qss[fs][(aPieces >> 8) & 255][1]
        & qss[fs][(aPieces >> 16) & 255][2]
        & qss[fs][(aPieces >> 24) & 255][3]
        & qss[fs][(aPieces >> 32) & 255][4]
        & qss[fs][(aPieces >> 40) & 255][5]
        & qss[fs][(aPieces >> 48) & 255][6]
        & notme;

As can be seen the code has already been optimised a little by embedding a rank lookup into the existing queen table thus reducing the lookups to 7 instead of the wiki's 8. THEN if the rank is removed and the file is shifted to the a file like in the wiki all that is left is the diagonals that can be manipulated into 2 ranks. That reduces the 7 kind of expensive lookups down to only 3 plus the rank lookup similar to Gerd's Kindergarten bitboards. The Queen was already competitive with magic and maybe even better. The rook also seems promising if handled like the wiki suggest.
https://www.chessprogramming.org/SISSY_Bitboards
Can someone explain to me why there has been no interest in seeing what SISSY bitboards could become? I have to say, I just don't understand.

mar · Post by **mar** » Sat Feb 13, 2021 3:42 pm

Hi Michael, that's very interesting indeed. As usual you try to come up with new and fresh ideas.

I guess SISSY didn't spark enough interest because they didn't beat magic (yet).
I think that beating queen attacks isn't so interesting either, because mixing magics for bishops/rooks and SISSY for queen would require even more memory for lookups to already a lot (relatively) for magics. I belive Volker (or someone else?) made an effort to compress the LUTs for magics IIRC so there can be some performance gain as well.

So - if you can make SISSY on par (or faster) than magics, I'm pretty sure everybody will switch to SISSY. If anyone can do it, then it's you.

What I like about SISSY is straightforward initialization with smaller LUT. Unfortunately what matters at the end of the day is performance - the only thing SISSY needs is to catch up to fancy magics. Fingers crossed.

Actually if Dany in the other thread has problems with bitboards (performance) - I guess SISSY might be a good starting point for him?

Mike Sherwin · Post by **Mike Sherwin** » Sat Feb 13, 2021 6:55 pm

mar wrote: ↑Sat Feb 13, 2021 3:42 pm Hi Michael, that's very interesting indeed. As usual you try to come up with new and fresh ideas.

I guess SISSY didn't spark enough interest because they didn't beat magic (yet).
I think that beating queen attacks isn't so interesting either, because mixing magics for bishops/rooks and SISSY for queen would require even more memory for lookups to already a lot (relatively) for magics. I belive Volker (or someone else?) made an effort to compress the LUTs for magics IIRC so there can be some performance gain as well.

So - if you can make SISSY on par (or faster) than magics, I'm pretty sure everybody will switch to SISSY. If anyone can do it, then it's you.

What I like about SISSY is straightforward initialization with smaller LUT. Unfortunately what matters at the end of the day is performance - the only thing SISSY needs is to catch up to fancy magics. Fingers crossed.

Actually if Dany in the other thread has problems with bitboards (performance) - I guess SISSY might be a good starting point for him?

I don't think I have it in me to be able to do the initialization for 2 rank storage. It is not as straightforward and my memory problem is getting worse. I just end up in confusion when I try. But, thanks for the vote of confidence. It makes me want to try harder. I was looking at my notes for the rook and once the rank is removed it becomes trivial to fold the file down to one lookup with only 4 shifts to only 6 bits per square. That reduces the rook to only 2 lookups and 1 &&. All I need is the initialization done. I can still do the rest, hopefully.

I would like it if Danny teamed up with me on this. The last couple of times that I responded to him I did not get a reply. He might be upset with me for some reason, idk. However, whomever does the initialization first whether it is me or someone else does it best and will earn full credit for their work.

Mike Sherwin · Post by **Mike Sherwin** » Sat Feb 13, 2021 7:26 pm

Okay, I just took a short meditative knap. What came to me to go from 6 ranks for the bishop down to three is for every index in rank 2 that has a superset stored find all indexes in rank 5 that also have a superset stored whose bits do not collide with rank 2 bits and combine them with & and store that superset at the new index. Then repeat for ranks (3, 6) and (4, 7). I think that sounds correct? I'll give it a try.

Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards