Split Index Super Set Yielding (SISSY) Bitboards

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Michael Sherwin »

I forgot to mention that by folding the bishop occupancy down to 24 bits which is possible because the relevant bits would not collide that the lookups could be reduced to 3 instead of 6. But that is a complication I don't want to tackle yet. And once the rank is removed from the rook the rook can be folded down to only 4 total lookups as well.
Last edited by Michael Sherwin on Sat Feb 15, 2020 6:54 pm, edited 1 time in total.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

I can confirm the rnk version works.
However, it seems a bit slower than before, maybe because it does a lookup from another table?
Maybe once the queen tables are smaller there will be some gain there, I don't know.
Martin Sedlak
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Michael Sherwin »

mar wrote: Sat Feb 15, 2020 6:49 pm I can confirm the rnk version works.
However, it seems a bit slower than before, maybe because it does a lookup from another table?
Maybe once the queen tables are smaller there will be some gain there, I don't know.
Thank you. I was afraid of that. We crossed post. See my last post above. There is plenty more opportunity for gains! :D
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

Michael Sherwin wrote: Sat Feb 15, 2020 6:56 pm Thank you. I was afraid of that. We crossed post. See my last post above. There is plenty more opportunity for gains! :D
It seems there's something wrong with how I measure, I seem to get a lot of noise even when I run the same executable (this is promising)
So I can't actually claim which one is faster (reliably), forget my last post. What's important is that your rnk version works well.

Also I forgot to cache-align the tables before so there may be some additional gain there.
Martin Sedlak
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Michael Sherwin »

mar wrote: Sat Feb 15, 2020 7:00 pm
Michael Sherwin wrote: Sat Feb 15, 2020 6:56 pm Thank you. I was afraid of that. We crossed post. See my last post above. There is plenty more opportunity for gains! :D
It seems there's something wrong with how I measure, I seem to get a lot of noise even when I run the same executable (this is promising)
So I can't actually claim which one is faster (reliably), forget my last post. What's important is that your rnk version works well.

Also I forgot to cache-align the tables before so there may be some additional gain there.
Wow, thank you for letting me know!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Gerd Isenberg »

Michael Sherwin wrote: Sat Feb 15, 2020 2:11 pm
Thanks Martin! Interesting and really cool stuff. :D I just have three thoughts. One, can we get it down to just one dimension? Two, there is a caveat about the fact that since the full 8 bit index is used that the & 255 can be done away with. I tried my original union idea and it is faster.

Code: Select all

       case WQUEEN:
        h->moves[id] = qss[fs][occ.b08.rank1][0] 
                     & qss[fs][occ.b08.rank2][1]
                     & qss[fs][occ.b08.rank3][2]
                     & qss[fs][occ.b08.rank4][3]
                     & qss[fs][occ.b08.rank5][4]
                     & qss[fs][occ.b08.rank6][5]
                     & qss[fs][occ.b08.rank7][6]
                     & qss[fs][occ.b08.rank8][7];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;
That would be at least 8 less instructions, right? And three, the optimised (hopefully) idea requires only 7 lookups instead of the eight I initially settled on. It is strange but extracting the rank first allows only 6 more lookups after that even though one of the 6 may be redundant. Very strange!
If I compare the generated assembly of both versions - that seems to be a domain of assembly programmers.
That stuff seems much to difficult and long for the compilers - so everything is possible :-)
https://godbolt.org/
x86-64 clang (experimental -Clifetime)

Code: Select all

typedef unsigned long long  u64;                  
typedef unsigned char  u08;

union bbu {u64 b64; u08 b08[8];};

u64 qss[64][256][8]; // 1M

u64 queenAttacks(int sq, u64 occ) {                  u64 queenAttacks(int sq, u64 occ) {
                                                       bbu o; o.b64 = occ;
  return                                               return 
    qss[sq][ occ        & 255][0] &                      qss[sq][o.b08[0]][0] &
    qss[sq][(occ >>  8) & 255][1] &                      qss[sq][o.b08[1]][1] &
    qss[sq][(occ >> 16) & 255][2] &                      qss[sq][o.b08[2]][2] &
    qss[sq][(occ >> 24) & 255][3] &                      qss[sq][o.b08[3]][3] &
    qss[sq][(occ >> 32) & 255][4] &                      qss[sq][o.b08[4]][4] &
    qss[sq][(occ >> 40) & 255][5] &                      qss[sq][o.b08[5]][5] &
    qss[sq][(occ >> 48) & 255][6] &                      qss[sq][o.b08[6]][6] &
    qss[sq][(occ >> 56) & 255][7] ;                      qss[sq][o.b08[7]][7];
};                                                   };

queenAttacks(int, unsigned long long):               queenAttacks(int, unsigned long long):             
        push    rbp                                          push    rbp
        mov     rbp, rsp                                     mov     rbp, rsp
        mov     dword ptr [rbp - 4], edi                     mov     dword ptr [rbp - 4], edi
        mov     qword ptr [rbp - 16], rsi                    mov     qword ptr [rbp - 16], rsi
        movsxd  rax, dword ptr [rbp - 4]                     mov     rax, qword ptr [rbp - 16]
        shl     rax, 14                                      mov     qword ptr [rbp - 24], rax
        movabs  rcx, offset qss                              movsxd  rax, dword ptr [rbp - 4]
        mov     rdx, rcx                                     shl     rax, 14
        add     rdx, rax                                     movabs  rcx, offset qss
        mov     rax, qword ptr [rbp - 16]                    mov     rdx, rcx
        and     rax, 255                                     add     rdx, rax
        shl     rax, 6                                       movzx   edi, byte ptr [rbp - 24]
        add     rdx, rax                                     mov     eax, edi
        mov     rax, qword ptr [rdx]                         shl     rax, 6
        movsxd  rdx, dword ptr [rbp - 4]                     add     rdx, rax
        shl     rdx, 14                                      mov     rax, qword ptr [rdx]
        mov     rsi, rcx                                     movsxd  rdx, dword ptr [rbp - 4]
        add     rsi, rdx                                     shl     rdx, 14
        mov     rdx, qword ptr [rbp - 16]                    mov     rsi, rcx
        shr     rdx, 8                                       add     rsi, rdx
        and     rdx, 255                                     movzx   edi, byte ptr [rbp - 23]
        shl     rdx, 6                                       mov     edx, edi
        add     rsi, rdx                                     shl     rdx, 6
        and     rax, qword ptr [rsi + 8]                     add     rsi, rdx
        movsxd  rdx, dword ptr [rbp - 4]                     and     rax, qword ptr [rsi + 8]
        shl     rdx, 14                                      movsxd  rdx, dword ptr [rbp - 4]
        mov     rsi, rcx                                     shl     rdx, 14
        add     rsi, rdx                                     mov     rsi, rcx
        mov     rdx, qword ptr [rbp - 16]                    add     rsi, rdx
        shr     rdx, 16                                      movzx   edi, byte ptr [rbp - 22]
        and     rdx, 255                                     mov     edx, edi
        shl     rdx, 6                                       shl     rdx, 6
        add     rsi, rdx                                     add     rsi, rdx
        and     rax, qword ptr [rsi + 16]                    and     rax, qword ptr [rsi + 16]
        movsxd  rdx, dword ptr [rbp - 4]                     movsxd  rdx, dword ptr [rbp - 4]
        shl     rdx, 14                                      shl     rdx, 14
        mov     rsi, rcx                                     mov     rsi, rcx
        add     rsi, rdx                                     add     rsi, rdx
        mov     rdx, qword ptr [rbp - 16]                    movzx   edi, byte ptr [rbp - 21]
        shr     rdx, 24                                      mov     edx, edi
        and     rdx, 255                                     shl     rdx, 6
        shl     rdx, 6                                       add     rsi, rdx
        add     rsi, rdx                                     and     rax, qword ptr [rsi + 24]
        and     rax, qword ptr [rsi + 24]                    movsxd  rdx, dword ptr [rbp - 4]
        movsxd  rdx, dword ptr [rbp - 4]                     shl     rdx, 14
        shl     rdx, 14                                      mov     rsi, rcx
        mov     rsi, rcx                                     add     rsi, rdx
        add     rsi, rdx                                     movzx   edi, byte ptr [rbp - 20]
        mov     rdx, qword ptr [rbp - 16]                    mov     edx, edi
        shr     rdx, 32                                      shl     rdx, 6
        and     rdx, 255                                     add     rsi, rdx
        shl     rdx, 6                                       and     rax, qword ptr [rsi + 32]
        add     rsi, rdx                                     movsxd  rdx, dword ptr [rbp - 4]
        and     rax, qword ptr [rsi + 32]                    shl     rdx, 14
        movsxd  rdx, dword ptr [rbp - 4]                     mov     rsi, rcx
        shl     rdx, 14                                      add     rsi, rdx
        mov     rsi, rcx                                     movzx   edi, byte ptr [rbp - 19]
        add     rsi, rdx                                     mov     edx, edi
        mov     rdx, qword ptr [rbp - 16]                    shl     rdx, 6
        shr     rdx, 40                                      add     rsi, rdx
        and     rdx, 255                                     and     rax, qword ptr [rsi + 40]
        shl     rdx, 6                                       movsxd  rdx, dword ptr [rbp - 4]
        add     rsi, rdx                                     shl     rdx, 14
        and     rax, qword ptr [rsi + 40]                    mov     rsi, rcx
        movsxd  rdx, dword ptr [rbp - 4]                     add     rsi, rdx
        shl     rdx, 14                                      movzx   edi, byte ptr [rbp - 18]
        mov     rsi, rcx                                     mov     edx, edi
        add     rsi, rdx                                     shl     rdx, 6
        mov     rdx, qword ptr [rbp - 16]                    add     rsi, rdx
        shr     rdx, 48                                      and     rax, qword ptr [rsi + 48]
        and     rdx, 255                                     movsxd  rdx, dword ptr [rbp - 4]
        shl     rdx, 6                                       shl     rdx, 14
        add     rsi, rdx                                     add     rcx, rdx
        and     rax, qword ptr [rsi + 48]                    movzx   edi, byte ptr [rbp - 17]
        movsxd  rdx, dword ptr [rbp - 4]                     mov     edx, edi
        shl     rdx, 14                                      shl     rdx, 6
        add     rcx, rdx                                     add     rcx, rdx
        mov     rdx, qword ptr [rbp - 16]                    and     rax, qword ptr [rcx + 56]
        shr     rdx, 56                                      pop     rbp
        and     rdx, 255                                     ret
        shl     rdx, 6            
        add     rcx, rdx            
        and     rax, qword ptr [rcx + 56]            
        pop     rbp            
        ret            
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

Gerd Isenberg wrote: Sat Feb 15, 2020 8:13 pm If I compare the generated assembly of both versions - that seems to be a domain of assembly programmers.
That stuff seems much to difficult and long for the compilers - so everything is possible :-)
https://godbolt.org/
x86-64 clang (experimental -Clifetime)

Code: Select all

...snip...
That assembly doesn't look like -O3, I got vastly smaller code.
Martin Sedlak
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Michael Sherwin »

Hey Martin and Gerd of course there is the possibility that handwritten assembler would be a benefit. There is much work left to be done. A small improvement is to embed the rnk table into qss[64][256][0] like so.

Code: Select all

void InitializeRNK() {
  u08 sq, sqr, i;
  s08 x, dx, y, dy;
  u64 bb, b;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for(i = 0; i < 128; i++) {
      bb = 0;
      b = (u64)i << (sq & 56);
      for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= ONE << sqr;
        bob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= ONE << sqr;
        bob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= ONE << sqr;
        bob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= ONE << sqr;
        bob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dx = -1; x + dx > -1; dx--) {
        sqr = (y << 3) + x + dx;
        bb |= ONE << sqr;
        rob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dx = +1; x + dx < +8; dx++) {
        sqr = (y << 3) + x + dx;
        bb |= ONE << sqr;
        rob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dy = +1; y + dy < +8; dy++) {
        sqr = ((y + dy) << 3) + x;
        bb |= ONE << sqr;
        rob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      for (dy = -1; y + dy > -1; dy--) {
        sqr = ((y + dy) << 3) + x;
        bb |= ONE << sqr;
        rob[sq] |= ONE << sqr;
        if ((ONE << sqr) & b) break;
      }
      qss[sq][i][0] = bb;
    }
  }
}

      case WROOK:
        h->moves[id] = qss[fs][(aPieces >> (fs & 56)) & 127][0] 
                     & qss[fs][(aPieces >>  8) & 255][1]
                     & qss[fs][(aPieces >> 16) & 255][2]
                     & qss[fs][(aPieces >> 24) & 255][3]
                     & qss[fs][(aPieces >> 32) & 255][4]
                     & qss[fs][(aPieces >> 40) & 255][5]
                     & qss[fs][(aPieces >> 48) & 255][6]
                     & rob[fs];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;
       case WQUEEN:
        h->moves[id] = qss[fs][(aPieces >> (fs & 56)) & 127][0] 
                     & qss[fs][(aPieces >>  8) & 255][1]
                     & qss[fs][(aPieces >> 16) & 255][2]
                     & qss[fs][(aPieces >> 24) & 255][3]
                     & qss[fs][(aPieces >> 32) & 255][4]
                     & qss[fs][(aPieces >> 40) & 255][5]
                     & qss[fs][(aPieces >> 48) & 255][6];
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Gerd Isenberg »

mar wrote: Sat Feb 15, 2020 8:22 pm
Gerd Isenberg wrote: Sat Feb 15, 2020 8:13 pm If I compare the generated assembly of both versions - that seems to be a domain of assembly programmers.
That stuff seems much to difficult and long for the compilers - so everything is possible :-)
https://godbolt.org/
x86-64 clang (experimental -Clifetime)

Code: Select all

...snip...
That assembly doesn't look like -O3, I got vastly smaller code.
indeed, I was not that familar with the online compiler, this is the union version, which looks quite nice ...

Code: Select all

queenAttacks(int, unsigned long long):                     # @queenAttacks(int, unsigned long long)
.Lfunc_begin0:
        push    rbx
.Ltmp0:
        mov     rdx, rsi
        mov     rcx, rsi
        mov     r11, rsi
        mov     r10, rsi
        mov     r9, rsi
        mov     r8, rsi
        movzx   ebx, sil
        mov     rax, rsi
.Ltmp1:
        shr     rax, 2
.Ltmp2:
        shr     rdx, 10
        shr     rcx, 18
        shr     r11, 26
        shr     r10, 34
        movsxd  rsi, edi
        shl     rbx, 6
        shl     rsi, 14
        and     eax, 16320
        mov     rax, qword ptr [rsi + rax + qss+8]
        and     rax, qword ptr [rsi + rbx + qss]
        shr     r9, 42
        and     edx, 16320
        and     rax, qword ptr [rsi + rdx + qss+16]
        shr     r8, 50
        and     ecx, 16320
        and     rax, qword ptr [rsi + rcx + qss+24]
        and     r8d, -64
        and     r11d, 16320
        and     rax, qword ptr [rsi + r11 + qss+32]
        and     r10d, 16320
        and     rax, qword ptr [rsi + r10 + qss+40]
        and     r9d, 16320
        and     rax, qword ptr [rsi + r9 + qss+48]
        and     rax, qword ptr [rsi + r8 + qss+56]
        pop     rbx
        ret
The shift right and 255 version takes some register less:

Code: Select all

_queenAttacks(int, unsigned long long):                    # @_queenAttacks(int, unsigned long long)
.Lfunc_begin1:
        movsxd  rcx, edi
        movzx   edx, sil
        shl     rdx, 6
        shl     rcx, 14
        mov     rax, rsi
        shr     rax, 2
        and     eax, 16320
        mov     rax, qword ptr [rcx + rax + qss+8]
        and     rax, qword ptr [rcx + rdx + qss]
        mov     rdx, rsi
        shr     rdx, 10
        and     edx, 16320
        and     rax, qword ptr [rcx + rdx + qss+16]
        mov     rdx, rsi
        shr     rdx, 18
        and     edx, 16320
        and     rax, qword ptr [rcx + rdx + qss+24]
        mov     rdx, rsi
        shr     rdx, 26
        and     edx, 16320
        and     rax, qword ptr [rcx + rdx + qss+32]
        mov     rdx, rsi
        shr     rdx, 34
        and     edx, 16320
        and     rax, qword ptr [rcx + rdx + qss+40]
        mov     rdx, rsi
        shr     rdx, 42
        and     edx, 16320
        and     rax, qword ptr [rcx + rdx + qss+48]
        shr     rsi, 50
.Ltmp4:
        and     esi, -64
        and     rax, qword ptr [rcx + rsi + qss+56]
        ret
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Gerd Isenberg »

Michael Sherwin wrote: Sat Feb 15, 2020 8:40 pm Hey Martin and Gerd of course there is the possibility that handwritten assembler would be a benefit.
It seems that x86-64 Clang -O3 is hard to beat. The union version prepares many registers with some shift at the beginning to and with 255 aka 16320 = 255 << 6. The explicit shift/and version uses less registers, but has therefore more dependencies.
https://godbolt.org/
Michael Sherwin wrote: There is much work left to be done. A small improvement is to embed the rnk table into qss[64][256][0] like so.
Sure. Your approach is really promising - wiki page is in preparation ...