Split Index Super Set Yielding (SISSY) Bitboards

Michael Sherwin · Post by **Michael Sherwin** » Sat Feb 15, 2020 9:14 pm

Gerd Isenberg wrote: ↑Sat Feb 15, 2020 9:07 pm
Michael Sherwin wrote: ↑Sat Feb 15, 2020 8:40 pm Hey Martin and Gerd of course there is the possibility that handwritten assembler would be a benefit.
It seems that x86-64 Clang -O3 is hard to beat. The union version prepares many registers with some shift at the beginning to and with 255 aka 16320 = 255 << 6. The explicit shift/and version uses less registers, but has therefore more dependencies.
https://godbolt.org/
Michael Sherwin wrote: There is much work left to be done. A small improvement is to embed the rnk table into qss[64][256][0] like so.
Sure. Your approach is really promising - wiki page is in preparation ...

Okay, maybe hand written assembler is immaterial, lol. I never expected a compiler would be that good! Oh wow, a wiki page already? Thanks! I guess it can be updated when important advances are made.

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 1:19 am

Michael Sherwin wrote: ↑Sat Feb 15, 2020 9:14 pm
Gerd Isenberg wrote: ↑Sat Feb 15, 2020 9:07 pm
Michael Sherwin wrote: ↑Sat Feb 15, 2020 8:40 pm Hey Martin and Gerd of course there is the possibility that handwritten assembler would be a benefit.
It seems that x86-64 Clang -O3 is hard to beat. The union version prepares many registers with some shift at the beginning to and with 255 aka 16320 = 255 << 6. The explicit shift/and version uses less registers, but has therefore more dependencies.
https://godbolt.org/
Michael Sherwin wrote: There is much work left to be done. A small improvement is to embed the rnk table into qss[64][256][0] like so.
Sure. Your approach is really promising - wiki page is in preparation ...
Okay, maybe hand written assembler is immaterial, lol. I never expected a compiler would be that good! Oh wow, a wiki page already? Thanks! I guess it can be updated when important advances are made.

If I were to try handwritten assembler it would look like this. It is a shame that 8 bit index registers are not allowed in 32 bit or 64 bit assembler. The whole reason for the union method was because I thought there were 8 bit index registers.

This code would have flown if those extra 8 moves were not needed. Still there are no dependencies in this code that would prevent both pipelines executing in parallel. That should count for something.

RayAttacks PROC

; rcx = sq
; rdx = address of rss - might as well pass it as an argument
; r8 = occ

shl rcx, 11 ; * 2048
mov r9, r8
add rdx, rcx
shr r9, 32
movsx r10, r8b
movsx r11, r9b
mov rax, [rdx + r10 * 8]
mov rcx, [rdx + r11 * 8 + (4 * 131072)]
shr r8, 8
shr r9, 8
movsx r10, r8b
movsx r11, r9b
and rax, [rdx + r10 * 8 + 131072]
and rcx, [rdx + r11 * 8 + (5 * 131072)]
shr r8, 8
shr r9, 8
movsx r10, r8b
movsx r11, r9b
and rax, [rdx + r10 * 8 + (2 * 131072)]
and rcx, [rdx + r11 * 8 + (6 * 131072)]
shr r8, 8
shr r9, 8
movsx r10, r8b
movsx r11, r9b
and rax, [rdx + r10 * 8 + (3 * 131072)]
and rcx, [rdx + r11 * 8 + (7 * 131072)]
and rax, rcx
ret

RayAttacks ENDP

TEXT ENDS

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 2:41 am

In above post change movsx to movzx. My bad!

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 6:06 am

Turns out that a union is of benefit after all.

_DATA SEGMENT

bbs STRUCT
r1 BYTE ?
r2 BYTE ?
r3 BYTE ?
r4 BYTE ?
r5 BYTE ?
r6 BYTE ?
r7 BYTE ?
r8 BYTE ?
bbs ENDS

bbu UNION
bbs<>
b64 QWORD ?
bbu ENDS

occ bbu<>

_DATA ENDS

_TEXT SEGMENT

RayAttacks PROC

; rcx = sq
; rdx = address of rss
; r8 = occ

shl rcx, 11 ; sq * 2048
mov occ.b64, r8
add rdx, rcx
movzx r8, occ.r1
movzx r9, occ.r2
mov rax, [rdx + r8 * 8]
mov rcx, [rdx + r9 * 8 + 131072]
movzx r8, occ.r3
movzx r9, occ.r4
and rax, [rdx + r8 * 8 + (2 * 131072)]
and rcx, [rdx + r9 * 8 + (3 * 131072)]
movzx r8, occ.r5
movzx r9, occ.r6
and rax, [rdx + r8 * 8 + (4 * 131072)]
and rcx, [rdx + r9 * 8 + (5 * 131072)]
movzx r8, occ.r7
movzx r9, occ.r8
and rax, [rdx + r8 * 8 + (6 * 131072)]
and rcx, [rdx + r9 * 8 + (7 * 131072)]
and rax, rcx
ret

RayAttacks ENDP

TEXT ENDS

END

So 20 cheap instructions executing in parallel! This SISSY is working out and getting stronger!!

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Feb 17, 2020 11:32 am

Michael Sherwin wrote: ↑Mon Feb 17, 2020 6:06 am Turns out that a union is of benefit after all.

_DATA SEGMENT

bbs STRUCT
r1 BYTE ?
r2 BYTE ?
r3 BYTE ?
r4 BYTE ?
r5 BYTE ?
r6 BYTE ?
r7 BYTE ?
r8 BYTE ?
bbs ENDS

bbu UNION
bbs<>
b64 QWORD ?
bbu ENDS

occ bbu<>

_DATA ENDS

_TEXT SEGMENT

RayAttacks PROC

; rcx = sq
; rdx = address of rss
; r8 = occ

shl rcx, 11 ; sq * 2048
mov occ.b64, r8
add rdx, rcx
movzx r8, occ.r1
movzx r9, occ.r2
mov rax, [rdx + r8 * 8]
mov rcx, [rdx + r9 * 8 + 131072]
movzx r8, occ.r3
movzx r9, occ.r4
and rax, [rdx + r8 * 8 + (2 * 131072)]
and rcx, [rdx + r9 * 8 + (3 * 131072)]
movzx r8, occ.r5
movzx r9, occ.r6
and rax, [rdx + r8 * 8 + (4 * 131072)]
and rcx, [rdx + r9 * 8 + (5 * 131072)]
movzx r8, occ.r7
movzx r9, occ.r8
and rax, [rdx + r8 * 8 + (6 * 131072)]
and rcx, [rdx + r9 * 8 + (7 * 131072)]
and rax, rcx
ret

RayAttacks ENDP

TEXT ENDS

END

So 20 cheap instructions executing in parallel! This SISSY is working out and getting stronger!!

But there is an additional 64-bit memory store followed by 8 partial byte memory reads from the same memory location. At least in former times that was worth some extra penalty cycles. But as always, empirical evidence is what counts. One may try movzx from ah, al type of byte registers combined with 3 shifts by 16:

Code: Select all

RayAttacks PROC

; rcx = sq
; rdx = address of rss
; r8 = occ

shl rcx, 11 ; sq * 2048
add rdx, rcx
movzx r9, r8l
movzx r10, r8h
shr r8, 16
mov rax, [rdx + r9 * 8]
mov rcx, [rdx + r10 * 8 + 131072]
movzx r9, r8l
movzx r10, r8h
shr r8, 16
and rax, [rdx + r9 * 8 + (2 * 131072)]
and rcx, [rdx + r10 * 8 + (3 * 131072)]
movzx r9, r8l
movzx r10, r8h
shr r8, 16
and rax, [rdx + r9 * 8 + (4 * 131072)]
and rcx, [rdx + r10 * 8 + (5 * 131072)]
movzx r9, r8l
movzx r10, r8h
and rax, [rdx + r9 * 8 + (6 * 131072)]
and rcx, [rdx + r10 * 8 + (7 * 131072)]
and rax, rcx
ret

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 2:06 pm

I looked online for hours yesterday trying to find information on assembler unions and register usage. I found little tidbits "here and there" but nothing very useful. In the end I just mainly guessed how to write the code. And code analysis did not complain. For registers r8, r9 I did not see that there were r8l ... or r8h ... registers. I only saw r8b ... registers. If there are r8h ... registers then I prefer your code. I will see what code analysis has to say. On a modern processor there should not be any penalty for the one byte memory reads. They should be in the fastest cpu cache and they should be prefetchable and waiting in the pipeline. However, I will eventually test all ideas so keep them coming!

I think maybe al->dl and ah->dh can still be used in the way that you suggest so maybe repurposing some registers would work well. If I'm wrong about there being no r8h please let me know. I will make a note!

P.S. Note: Code analysis says,
Severity Code Description Project File Line Suppression State
Error A2006 undefined symbol : r8l SISSY C:\Users\Mike Sherwin\source\repos\SISSY\Sissy.asm 34
Severity Code Description Project File Line Suppression State
Error A2006 undefined symbol : r8h SISSY C:\Users\Mike Sherwin\source\repos\SISSY\Sissy.asm 35

Note2: The shr reg, 16 adds a dependency that can stall the pipeline as it needs to be inserted by itself at some point as it does not have a counterpart, so it is like adding two instructions instead of just one.

Michael Sherwin · Post by **Michael Sherwin** » Mon Feb 17, 2020 5:40 pm

@Martin - If you haven't given up yet on SISSY then I have the following code I'd like you to try. If you have given up then what is your conclusion? I'm kinda hanging out in the wind here.

Because, I'm going with an assembler function I decided upon a ray array like so.

u64* ray = new u64[131072];

The only difference in the init is the addressing into the array.

*(ray + (l << 14) + (sq << 8) + i) = bb;

Of course you can change that to something else if you wish. And I'm not 100% sure my math is correct.

Here is the initialization.

Code: Select all

void InitializeRay() {
  u08 sq, sqr, k, l;
  s08 x, dx, y, dy;
  s32 i;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    bob[sq] = 0;
    rob[sq] = 0;
    for (i = 0; i < 256; i++) {
      for (k = 0, l = 0; k <= 56; k += 8, l++) {
        bb = 0;
        b = (u64)i << k;
        for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= ONE << sqr;
          bob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = -1; x + dx > -1; dx--) {
          sqr = (y << 3) + x + dx;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dx = +1; x + dx < +8; dx++) {
          sqr = (y << 3) + x + dx;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dy = +1; y + dy < +8; dy++) {
          sqr = ((y + dy) << 3) + x;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        for (dy = -1; y + dy > -1; dy--) {
          sqr = ((y + dy) << 3) + x;
          bb |= ONE << sqr;
          rob[sq] |= ONE << sqr;
          if ((ONE << sqr) & b) break;
        }
        *(ray + (l << 14) + (sq << 8) + i) = bb;

I use different amounts in the assembler code because C works with 8 byte math whereas assembler works with byte math. I'm only saying these obvious things because I'm not 100% sure about them. Please tell me if my thinking is off.

Here is the code for the assembler version.

Code: Select all

_DATA SEGMENT

occs STRUCT
    rank1 BYTE ?
    rank2 BYTE ?
    rank3 BYTE ?
    rank4 BYTE ?
    rank5 BYTE ?
    rank6 BYTE ?
    rank7 BYTE ?
    rank8 BYTE ?
occs ENDS

occu UNION
    occs<>
    b64 QWORD ?
occu ENDS

oc occu <>

_DATA ENDS

_TEXT SEGMENT

RayAttacks PROC

; rcx = sq
; rdx = address of rss
; r8 = occ

shl rcx, 11 
mov oc.b64, r8
add rdx, rcx
movzx r8, oc.rank1
movzx r9, oc.rank2
mov rax, [rdx + r8 * 8]
mov rcx, [rdx + r9 * 8 + 131072]
movzx r8, oc.rank3
movzx r9, oc.rank4
and rax, [rdx + r8 * 8 + (2 * 131072)]
and rcx, [rdx + r9 * 8 + (3 * 131072)]
movzx r8, oc.rank5
movzx r9, oc.rank6
and rax, [rdx + r8 * 8 + (4 * 131072)]
and rcx, [rdx + r9 * 8 + (5 * 131072)]
movzx r8, oc.rank7
movzx r9, oc.rank8
and rax, [rdx + r8 * 8 + (6 * 131072)]
and rcx, [rdx + r9 * 8 + (7 * 131072)]
and rax, rcx
ret

RayAttacks ENDP

_TEXT ENDS

END

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Feb 17, 2020 5:58 pm

You are right, there is no r8h (... r15h), only r8l (Intel style) or r8b (MASM STYLE), so one has to use rax, rbx, rcx, rdx with appropriate l/h byte partial registers for that purpose, and maybe ror reg64 16 instead of shift - or better trust the compiler.
https://software.intel.com/en-us/articl ... 4-assembly

mar · Post by **mar** » Mon Feb 17, 2020 7:15 pm

Michael Sherwin wrote: ↑Mon Feb 17, 2020 5:40 pm @Martin - If you haven't given up yet on SISSY then I have the following code I'd like you to try. If you have given up then what is your conclusion? I'm kinda hanging out in the wind here.

I haven't given up Michael, I'll look at it in the coming days, no worries

mar · Post by **mar** » Mon Feb 17, 2020 7:24 pm

Michael Sherwin wrote: ↑Mon Feb 17, 2020 2:06 pm For registers r8, r9 I did not see that there were r8l ... or r8h ... registers. I only saw r8b ... registers. If there are r8h ... registers then I prefer your code.

Yes, they gave up on xh registers (in 64-bit mode with rex prefix, otherwise old ah, bh etc. still work) but instead you have sil, dil, r8l and so on. Way more useful I think.
I've never seen any compiler generate code that uses ah, bh anyway, maybe except in the old 16-bit days

Also, when you target a 32-bit register then the upper part is automatically zeroed. This is useful when you do indexed lookups, because addressing implies 64-bit registers by default (but operands default to 32-bit without rex prefix).
so for example mov eax,eax in 64-bit mode actually clears 32 upper bits of rax, but then you can do mov edx,[rax*4 + base] without having to worry about anything (and without requiring the rex prefix).
Why I'm talking about this => maybe you could use it to your advantage in the assembly, unless you already knew of course.

Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards