Split Index Super Set Yielding (SISSY) Bitboards
Moderators: hgm, Rebel, chrisw
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: Split Index Super Set Yielding (SISSY) Bitboards
I forgot to mention that by folding the bishop occupancy down to 24 bits which is possible because the relevant bits would not collide that the lookups could be reduced to 3 instead of 6. But that is a complication I don't want to tackle yet. And once the rank is removed from the rook the rook can be folded down to only 4 total lookups as well.
Last edited by Michael Sherwin on Sat Feb 15, 2020 6:54 pm, edited 1 time in total.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Split Index Super Set Yielding (SISSY) Bitboards
I can confirm the rnk version works.
However, it seems a bit slower than before, maybe because it does a lookup from another table?
Maybe once the queen tables are smaller there will be some gain there, I don't know.
However, it seems a bit slower than before, maybe because it does a lookup from another table?
Maybe once the queen tables are smaller there will be some gain there, I don't know.
Martin Sedlak
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: Split Index Super Set Yielding (SISSY) Bitboards
Thank you. I was afraid of that. We crossed post. See my last post above. There is plenty more opportunity for gains!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Split Index Super Set Yielding (SISSY) Bitboards
It seems there's something wrong with how I measure, I seem to get a lot of noise even when I run the same executable (this is promising)Michael Sherwin wrote: ↑Sat Feb 15, 2020 6:56 pm Thank you. I was afraid of that. We crossed post. See my last post above. There is plenty more opportunity for gains!
So I can't actually claim which one is faster (reliably), forget my last post. What's important is that your rnk version works well.
Also I forgot to cache-align the tables before so there may be some additional gain there.
Martin Sedlak
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: Split Index Super Set Yielding (SISSY) Bitboards
Wow, thank you for letting me know!mar wrote: ↑Sat Feb 15, 2020 7:00 pmIt seems there's something wrong with how I measure, I seem to get a lot of noise even when I run the same executable (this is promising)Michael Sherwin wrote: ↑Sat Feb 15, 2020 6:56 pm Thank you. I was afraid of that. We crossed post. See my last post above. There is plenty more opportunity for gains!
So I can't actually claim which one is faster (reliably), forget my last post. What's important is that your rnk version works well.
Also I forgot to cache-align the tables before so there may be some additional gain there.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Split Index Super Set Yielding (SISSY) Bitboards
If I compare the generated assembly of both versions - that seems to be a domain of assembly programmers.Michael Sherwin wrote: ↑Sat Feb 15, 2020 2:11 pm
Thanks Martin! Interesting and really cool stuff. I just have three thoughts. One, can we get it down to just one dimension? Two, there is a caveat about the fact that since the full 8 bit index is used that the & 255 can be done away with. I tried my original union idea and it is faster.That would be at least 8 less instructions, right? And three, the optimised (hopefully) idea requires only 7 lookups instead of the eight I initially settled on. It is strange but extracting the rank first allows only 6 more lookups after that even though one of the 6 may be redundant. Very strange!Code: Select all
case WQUEEN: h->moves[id] = qss[fs][occ.b08.rank1][0] & qss[fs][occ.b08.rank2][1] & qss[fs][occ.b08.rank3][2] & qss[fs][occ.b08.rank4][3] & qss[fs][occ.b08.rank5][4] & qss[fs][occ.b08.rank6][5] & qss[fs][occ.b08.rank7][6] & qss[fs][occ.b08.rank8][7]; h->attacked |= h->moves[id]; h->moves[id] &= notMe; continue;
That stuff seems much to difficult and long for the compilers - so everything is possible
https://godbolt.org/
x86-64 clang (experimental -Clifetime)
Code: Select all
typedef unsigned long long u64;
typedef unsigned char u08;
union bbu {u64 b64; u08 b08[8];};
u64 qss[64][256][8]; // 1M
u64 queenAttacks(int sq, u64 occ) { u64 queenAttacks(int sq, u64 occ) {
bbu o; o.b64 = occ;
return return
qss[sq][ occ & 255][0] & qss[sq][o.b08[0]][0] &
qss[sq][(occ >> 8) & 255][1] & qss[sq][o.b08[1]][1] &
qss[sq][(occ >> 16) & 255][2] & qss[sq][o.b08[2]][2] &
qss[sq][(occ >> 24) & 255][3] & qss[sq][o.b08[3]][3] &
qss[sq][(occ >> 32) & 255][4] & qss[sq][o.b08[4]][4] &
qss[sq][(occ >> 40) & 255][5] & qss[sq][o.b08[5]][5] &
qss[sq][(occ >> 48) & 255][6] & qss[sq][o.b08[6]][6] &
qss[sq][(occ >> 56) & 255][7] ; qss[sq][o.b08[7]][7];
}; };
queenAttacks(int, unsigned long long): queenAttacks(int, unsigned long long):
push rbp push rbp
mov rbp, rsp mov rbp, rsp
mov dword ptr [rbp - 4], edi mov dword ptr [rbp - 4], edi
mov qword ptr [rbp - 16], rsi mov qword ptr [rbp - 16], rsi
movsxd rax, dword ptr [rbp - 4] mov rax, qword ptr [rbp - 16]
shl rax, 14 mov qword ptr [rbp - 24], rax
movabs rcx, offset qss movsxd rax, dword ptr [rbp - 4]
mov rdx, rcx shl rax, 14
add rdx, rax movabs rcx, offset qss
mov rax, qword ptr [rbp - 16] mov rdx, rcx
and rax, 255 add rdx, rax
shl rax, 6 movzx edi, byte ptr [rbp - 24]
add rdx, rax mov eax, edi
mov rax, qword ptr [rdx] shl rax, 6
movsxd rdx, dword ptr [rbp - 4] add rdx, rax
shl rdx, 14 mov rax, qword ptr [rdx]
mov rsi, rcx movsxd rdx, dword ptr [rbp - 4]
add rsi, rdx shl rdx, 14
mov rdx, qword ptr [rbp - 16] mov rsi, rcx
shr rdx, 8 add rsi, rdx
and rdx, 255 movzx edi, byte ptr [rbp - 23]
shl rdx, 6 mov edx, edi
add rsi, rdx shl rdx, 6
and rax, qword ptr [rsi + 8] add rsi, rdx
movsxd rdx, dword ptr [rbp - 4] and rax, qword ptr [rsi + 8]
shl rdx, 14 movsxd rdx, dword ptr [rbp - 4]
mov rsi, rcx shl rdx, 14
add rsi, rdx mov rsi, rcx
mov rdx, qword ptr [rbp - 16] add rsi, rdx
shr rdx, 16 movzx edi, byte ptr [rbp - 22]
and rdx, 255 mov edx, edi
shl rdx, 6 shl rdx, 6
add rsi, rdx add rsi, rdx
and rax, qword ptr [rsi + 16] and rax, qword ptr [rsi + 16]
movsxd rdx, dword ptr [rbp - 4] movsxd rdx, dword ptr [rbp - 4]
shl rdx, 14 shl rdx, 14
mov rsi, rcx mov rsi, rcx
add rsi, rdx add rsi, rdx
mov rdx, qword ptr [rbp - 16] movzx edi, byte ptr [rbp - 21]
shr rdx, 24 mov edx, edi
and rdx, 255 shl rdx, 6
shl rdx, 6 add rsi, rdx
add rsi, rdx and rax, qword ptr [rsi + 24]
and rax, qword ptr [rsi + 24] movsxd rdx, dword ptr [rbp - 4]
movsxd rdx, dword ptr [rbp - 4] shl rdx, 14
shl rdx, 14 mov rsi, rcx
mov rsi, rcx add rsi, rdx
add rsi, rdx movzx edi, byte ptr [rbp - 20]
mov rdx, qword ptr [rbp - 16] mov edx, edi
shr rdx, 32 shl rdx, 6
and rdx, 255 add rsi, rdx
shl rdx, 6 and rax, qword ptr [rsi + 32]
add rsi, rdx movsxd rdx, dword ptr [rbp - 4]
and rax, qword ptr [rsi + 32] shl rdx, 14
movsxd rdx, dword ptr [rbp - 4] mov rsi, rcx
shl rdx, 14 add rsi, rdx
mov rsi, rcx movzx edi, byte ptr [rbp - 19]
add rsi, rdx mov edx, edi
mov rdx, qword ptr [rbp - 16] shl rdx, 6
shr rdx, 40 add rsi, rdx
and rdx, 255 and rax, qword ptr [rsi + 40]
shl rdx, 6 movsxd rdx, dword ptr [rbp - 4]
add rsi, rdx shl rdx, 14
and rax, qword ptr [rsi + 40] mov rsi, rcx
movsxd rdx, dword ptr [rbp - 4] add rsi, rdx
shl rdx, 14 movzx edi, byte ptr [rbp - 18]
mov rsi, rcx mov edx, edi
add rsi, rdx shl rdx, 6
mov rdx, qword ptr [rbp - 16] add rsi, rdx
shr rdx, 48 and rax, qword ptr [rsi + 48]
and rdx, 255 movsxd rdx, dword ptr [rbp - 4]
shl rdx, 6 shl rdx, 14
add rsi, rdx add rcx, rdx
and rax, qword ptr [rsi + 48] movzx edi, byte ptr [rbp - 17]
movsxd rdx, dword ptr [rbp - 4] mov edx, edi
shl rdx, 14 shl rdx, 6
add rcx, rdx add rcx, rdx
mov rdx, qword ptr [rbp - 16] and rax, qword ptr [rcx + 56]
shr rdx, 56 pop rbp
and rdx, 255 ret
shl rdx, 6
add rcx, rdx
and rax, qword ptr [rcx + 56]
pop rbp
ret
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Split Index Super Set Yielding (SISSY) Bitboards
That assembly doesn't look like -O3, I got vastly smaller code.Gerd Isenberg wrote: ↑Sat Feb 15, 2020 8:13 pm If I compare the generated assembly of both versions - that seems to be a domain of assembly programmers.
That stuff seems much to difficult and long for the compilers - so everything is possible
https://godbolt.org/
x86-64 clang (experimental -Clifetime)Code: Select all
...snip...
Martin Sedlak
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: Split Index Super Set Yielding (SISSY) Bitboards
Hey Martin and Gerd of course there is the possibility that handwritten assembler would be a benefit. There is much work left to be done. A small improvement is to embed the rnk table into qss[64][256][0] like so.
Code: Select all
void InitializeRNK() {
u08 sq, sqr, i;
s08 x, dx, y, dy;
u64 bb, b;
for (sq = 0; sq < 64; sq++) {
y = sq >> 3;
x = sq & 7;
for(i = 0; i < 128; i++) {
bb = 0;
b = (u64)i << (sq & 56);
for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
sqr = (((y + dy) << 3) + x + dx);
bb |= ONE << sqr;
bob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
sqr = (((y + dy) << 3) + x + dx);
bb |= ONE << sqr;
bob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
sqr = (((y + dy) << 3) + x + dx);
bb |= ONE << sqr;
bob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
sqr = (((y + dy) << 3) + x + dx);
bb |= ONE << sqr;
bob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dx = -1; x + dx > -1; dx--) {
sqr = (y << 3) + x + dx;
bb |= ONE << sqr;
rob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dx = +1; x + dx < +8; dx++) {
sqr = (y << 3) + x + dx;
bb |= ONE << sqr;
rob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dy = +1; y + dy < +8; dy++) {
sqr = ((y + dy) << 3) + x;
bb |= ONE << sqr;
rob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
for (dy = -1; y + dy > -1; dy--) {
sqr = ((y + dy) << 3) + x;
bb |= ONE << sqr;
rob[sq] |= ONE << sqr;
if ((ONE << sqr) & b) break;
}
qss[sq][i][0] = bb;
}
}
}
case WROOK:
h->moves[id] = qss[fs][(aPieces >> (fs & 56)) & 127][0]
& qss[fs][(aPieces >> 8) & 255][1]
& qss[fs][(aPieces >> 16) & 255][2]
& qss[fs][(aPieces >> 24) & 255][3]
& qss[fs][(aPieces >> 32) & 255][4]
& qss[fs][(aPieces >> 40) & 255][5]
& qss[fs][(aPieces >> 48) & 255][6]
& rob[fs];
h->attacked |= h->moves[id];
h->moves[id] &= notMe;
continue;
case WQUEEN:
h->moves[id] = qss[fs][(aPieces >> (fs & 56)) & 127][0]
& qss[fs][(aPieces >> 8) & 255][1]
& qss[fs][(aPieces >> 16) & 255][2]
& qss[fs][(aPieces >> 24) & 255][3]
& qss[fs][(aPieces >> 32) & 255][4]
& qss[fs][(aPieces >> 40) & 255][5]
& qss[fs][(aPieces >> 48) & 255][6];
h->attacked |= h->moves[id];
h->moves[id] &= notMe;
continue;
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Split Index Super Set Yielding (SISSY) Bitboards
indeed, I was not that familar with the online compiler, this is the union version, which looks quite nice ...mar wrote: ↑Sat Feb 15, 2020 8:22 pmThat assembly doesn't look like -O3, I got vastly smaller code.Gerd Isenberg wrote: ↑Sat Feb 15, 2020 8:13 pm If I compare the generated assembly of both versions - that seems to be a domain of assembly programmers.
That stuff seems much to difficult and long for the compilers - so everything is possible
https://godbolt.org/
x86-64 clang (experimental -Clifetime)Code: Select all
...snip...
Code: Select all
queenAttacks(int, unsigned long long): # @queenAttacks(int, unsigned long long)
.Lfunc_begin0:
push rbx
.Ltmp0:
mov rdx, rsi
mov rcx, rsi
mov r11, rsi
mov r10, rsi
mov r9, rsi
mov r8, rsi
movzx ebx, sil
mov rax, rsi
.Ltmp1:
shr rax, 2
.Ltmp2:
shr rdx, 10
shr rcx, 18
shr r11, 26
shr r10, 34
movsxd rsi, edi
shl rbx, 6
shl rsi, 14
and eax, 16320
mov rax, qword ptr [rsi + rax + qss+8]
and rax, qword ptr [rsi + rbx + qss]
shr r9, 42
and edx, 16320
and rax, qword ptr [rsi + rdx + qss+16]
shr r8, 50
and ecx, 16320
and rax, qword ptr [rsi + rcx + qss+24]
and r8d, -64
and r11d, 16320
and rax, qword ptr [rsi + r11 + qss+32]
and r10d, 16320
and rax, qword ptr [rsi + r10 + qss+40]
and r9d, 16320
and rax, qword ptr [rsi + r9 + qss+48]
and rax, qword ptr [rsi + r8 + qss+56]
pop rbx
ret
Code: Select all
_queenAttacks(int, unsigned long long): # @_queenAttacks(int, unsigned long long)
.Lfunc_begin1:
movsxd rcx, edi
movzx edx, sil
shl rdx, 6
shl rcx, 14
mov rax, rsi
shr rax, 2
and eax, 16320
mov rax, qword ptr [rcx + rax + qss+8]
and rax, qword ptr [rcx + rdx + qss]
mov rdx, rsi
shr rdx, 10
and edx, 16320
and rax, qword ptr [rcx + rdx + qss+16]
mov rdx, rsi
shr rdx, 18
and edx, 16320
and rax, qword ptr [rcx + rdx + qss+24]
mov rdx, rsi
shr rdx, 26
and edx, 16320
and rax, qword ptr [rcx + rdx + qss+32]
mov rdx, rsi
shr rdx, 34
and edx, 16320
and rax, qword ptr [rcx + rdx + qss+40]
mov rdx, rsi
shr rdx, 42
and edx, 16320
and rax, qword ptr [rcx + rdx + qss+48]
shr rsi, 50
.Ltmp4:
and esi, -64
and rax, qword ptr [rcx + rsi + qss+56]
ret
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Split Index Super Set Yielding (SISSY) Bitboards
It seems that x86-64 Clang -O3 is hard to beat. The union version prepares many registers with some shift at the beginning to and with 255 aka 16320 = 255 << 6. The explicit shift/and version uses less registers, but has therefore more dependencies.Michael Sherwin wrote: ↑Sat Feb 15, 2020 8:40 pm Hey Martin and Gerd of course there is the possibility that handwritten assembler would be a benefit.
https://godbolt.org/
Sure. Your approach is really promising - wiki page is in preparation ...Michael Sherwin wrote: There is much work left to be done. A small improvement is to embed the rnk table into qss[64][256][0] like so.