Comparison of all known Sliding lookup algorithms

abulmo2 · Post by **abulmo2** » Thu Feb 17, 2022 1:18 am

abulmo2 wrote: ↑Thu Feb 17, 2022 12:40 am Compilation failed on my Raspberry PI 4. It complains about the builtin_ia32_* functions . They are for Intel-like cpu, not ARM. It
also complains about the cpuid asm code.

I removed the offending algorithms until it compiled and get the following summary:

Code: Select all

Summary:
Name                               Performance [MQueens/s]       Threads             
Kogge-Stone                        420.560041                    12                  
Hyperbola Quintessence o^(o-2r)    375.721998                    10                  
Rotated Bitboards                  348.589842                    12                  
Hyperbola Quintessence Inline      312.154375                    11                  
Leorik                             292.477497                    12                  
Leorik Inline                      291.866313                    12                  
Obstruction Difference Inline      273.708702                    12                  
Obstruction Difference             273.026532                    11                  
SBAMG Inline                       262.724473                    12                  
SBAMG o^(o-3cbn)                   262.329003                    12                  
Dumb7 Fill                         247.719091                    11                  
Classical Bob-Mike                 246.416379                    11                  
Reference (Switch Lookup)          227.260638                    12                  
QBBEngine                          210.567611                    12                  
Black Magic BB - Fixed shift       200.863345                    12                  
Fancy Magic BB - Variable shift    162.179407                    7                   
Exploding Bitboards                158.991506                    12                  
Kindergarten                       73.108192                     1                   
Plain Magic BB                     64.542688                     12                  
SISSY Bitboards                    27.010867                     12

PS: my RPI 4 is overclocked at 1.8Ghz.

dangi12012 · Post by **dangi12012** » Thu Feb 17, 2022 12:10 pm

abulmo2 wrote: ↑Thu Feb 17, 2022 1:18 am

abulmo2 wrote: ↑Thu Feb 17, 2022 12:40 am Compilation failed on my Raspberry PI 4. It complains about the builtin_ia32_* functions . They are for Intel-like cpu, not ARM. It
also complains about the cpuid asm code.

I removed the offending algorithms until it compiled and get the following summary:

Code: Select all

Summary:
Name                               Performance [MQueens/s]       Threads             
Kogge-Stone                        420.560041                    12                  
Hyperbola Quintessence o^(o-2r)    375.721998                    10                  
Rotated Bitboards                  348.589842                    12                  
Hyperbola Quintessence Inline      312.154375                    11                  
Leorik                             292.477497                    12                  
Leorik Inline                      291.866313                    12                  
Obstruction Difference Inline      273.708702                    12                  
Obstruction Difference             273.026532                    11                  
SBAMG Inline                       262.724473                    12                  
SBAMG o^(o-3cbn)                   262.329003                    12                  
Dumb7 Fill                         247.719091                    11                  
Classical Bob-Mike                 246.416379                    11                  
Reference (Switch Lookup)          227.260638                    12                  
QBBEngine                          210.567611                    12                  
Black Magic BB - Fixed shift       200.863345                    12                  
Fancy Magic BB - Variable shift    162.179407                    7                   
Exploding Bitboards                158.991506                    12                  
Kindergarten                       73.108192                     1                   
Plain Magic BB                     64.542688                     12                  
SISSY Bitboards                    27.010867                     12

PS: my RPI 4 is overclocked at 1.8Ghz.

Interesting! - seems the PI is really bad at unpredictable lookups! I will make the other algos run via PI emulation in QEMU. I want my library to be standard C++ enough to make it run anywhere where a C++20 compiler is available. Maybe its possible to backport to C++14 too - then you can even pick your best performing algo in embedded situations.

dangi12012 · Post by **dangi12012** » Wed Mar 02, 2022 9:44 pm

BIG RASPBERRY PI UPDATE!

So I got myself one of those older Original models. Its a 700Mhz 32bit CPU. After 79minutes of compilation with Raspbian clang version 11.0.1-2 this is the result:

Code: Select all

model name      : ARMv6-compatible processor rev 7 (v6l)

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
Binary Neural Network              0.020274                      5852    [45kb]      pdep_u64, AVX2           no        Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                1.967893                      768     [6kb]       imul64                   no        Harald L  en                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          4.504871                      0       [0kb]       none                     yes       Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               1.796773                      0       [0kb]       AVX2                     no        Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      0.868572                      107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         1.170400                      0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone                        1.635208                      0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  1.506080                      1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          1.403284                      0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine
Classical Bob-Mike                 2.950869                      1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Leorik                             2.093863                      128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Leorik Inline                      2.012548                      0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Obstruction Difference             2.557333                      768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      1.903921                      0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Slide Arithmetic                   2.213375                      256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            1.868535                      0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
SBAMG o^(o-3cbn)                   4.039879                      576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       1.742707                      0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inf hr                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
Hyperbola Quintessence o^(o-2r)    3.791002                      256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      1.874561                      0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Kindergarten                       1.735982                      16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    0.527762                      180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Fancy Magic BB - Variable shift    1.968951                      93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
Plain Magic BB                     1.270352                      295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       2.211778                      88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     0.858988                      107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          0.868297                      107680  [841kb]     none                     yes       Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723

In the meantime I could also test on my Laptop:

Code: Select all

AMD Ryzen 5 5600H with Radeon Graphics

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
Binary Neural Network              35.255627                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                45.199613                     768     [6kb]       imul64                   no        Harald Lüßen                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          24.208464                     0       [0kb]       none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               121.325236                    0       [0kb]       AVX2                     no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      42.514175                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         48.334085                     0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone                        78.573318                     0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  32.100333                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          132.850790                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine
Classical Bob-Mike                 156.739607                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Leorik                             157.123385                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Leorik Inline                      86.767582                     0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Obstruction Difference             201.715930                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      73.182245                     0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Slide Arithmetic                   216.530283                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            85.571687                     0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
SBAMG o^(o-3cbn)                   188.533399                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       75.937352                     0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inführ                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
Hyperbola Quintessence o^(o-2r)    246.740963                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      75.012611                     0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Kindergarten                       351.707098                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    140.403262                    180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Fancy Magic BB - Variable shift    323.188796                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
Plain Magic BB                     404.792746                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       452.348821                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     643.066139                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          57.103211                     107680  [841kb]     none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723

People with a DDR5 system please run this repo and share your results:
https://github.com/Gigantua/Chess_Movegen

Convenience code:

Code: Select all

git clone https://github.com/Gigantua/Chess_Movegen.git && cd Chess_Movegen && make && ./chess_movegen

So what do we learn? Compilation takes 5 seconds vs 79minutes and runtime performance is similar weak for something that is a moder machine vs a 23 year old deskop equivalent. The code should now compile on anything from a pdp-8 to a supercomputer (if you have C++20)
Winner 32bit - 700mhz arm: Reference Switch lookup
Winner 64bit - 4ghz ryzen: constexpr PEXT

These winners make me happy because I invented the first winner switch code (with espresso heuristic logic minimizer), and created the inline constexpr version of pext which is around 20% faster than what is shared on the cpw.

Mike Sherwin · Post by **Mike Sherwin** » Thu Mar 03, 2022 9:35 am

Classical Bob-Mike was done wrong.

Being so under the weather for months with shingles coupled with my short term memory disability I couldn't keep it straight in my mind. I'm a lot better now.

Anyway, part of Bob was left out. This code represents Bob's and my method, that is if my mind is now working better. I wrote it in my style just because it is less confusing to me.

Code: Select all

	  occ |= 0x8000000000000001;
	  bb = ray[std::countr_zero(ray[fs].rwsNW & occ)].rayNW
	   | ray[std::countr_zero(ray[fs].rwsNN & occ)].rayNN
	   | ray[std::countr_zero(ray[fs].rwsNE & occ)].rayNE
	   | ray[std::countr_zero(ray[fs].rwsEE & occ)].rayEE
	   | ray[63 - std::countl_zero(ray[fs].rwsSE & occ)].raySE
	   | ray[63 - std::countl_zero(ray[fs].rwsSS & occ)].raySS
	   | ray[63 - std::countl_zero(ray[fs].rwsSW & occ)].raySW
	   | ray[63 - std::countl_zero(ray[fs].rwsWW & occ)].rayWW
	   ^ queen[fs];

bb will have all the outer bits for each direction and then remove those bits from queen[fs] in one XOR operation.
Therefore there is only 1 XOR instead of 8 XORs!
And really no need for separate rook and bishop calls because the design in the first place was to avoid that.

IMHO this will now perform much faster.

Someone please let me know if this is wrong. It is still a bit confusing for me.

dangi12012 · Post by **dangi12012** » Thu Mar 03, 2022 7:02 pm

Mike Sherwin wrote: ↑Thu Mar 03, 2022 9:35 am IMHO this will now perform much faster.

Someone please let me know if this is wrong. It is still a bit confusing for me.

Your code was correct but only 5% faster. But not for the reason you think. Any relevant compiler will figure out that XOR is transitive meaning
XOR XOR XOR = XOR (a | b | c) and that is seen in the assembly. But you need a different extra lookup that slows things down again.
BUT: Clang can do one more optimisation with your code that it did not see with the existing calls.

Proof:
https://godbolt.org/z/KT6znYzeG

Notice how there is vpunpcklqdq where there was none before.

I can also see another optimisation that you missed - if its faster by more than 140% I will add myself to authors of this shared algo

dangi12012 · Post by **dangi12012** » Thu Mar 03, 2022 7:10 pm

What is left to do:
Finding the ultimate algorithm.

Genetic programming:
Have a monte carlo tree search against an abstract syntax tree where each node is a cheap x86 instruction like XOR, AND, OR.
Also some nodes get more expensive instructions like std::countlzero and array lookups into: 8 rays, 4 edges, 4 corner squares.

The score of a tree node is the amount of times the algo produced the correct output for the 2^14 = 16k possible inputs.
It should find algos like SBAMG o^(o-3cbn) in a short amount of time.

Prediction:
There is an uncountable amount of cheap algorithms that were not discovered yet - some of them might be competitive against
Hyperbola Quintessence o^(o-2r) and SBAMG o^(o-3cbn)

Genetic programming is nothing new and many many phd dissertations use this to find a solution algorithm for some problems.
Could be interesting for chess and I will take a look when I have time.

Mike Sherwin · Post by **Mike Sherwin** » Thu Mar 03, 2022 8:00 pm

dangi12012 wrote: ↑Thu Mar 03, 2022 7:02 pm
Mike Sherwin wrote: ↑Thu Mar 03, 2022 9:35 am IMHO this will now perform much faster.

Someone please let me know if this is wrong. It is still a bit confusing for me.
Your code was correct but only 5% faster. But not for the reason you think. Any relevant compiler will figure out that XOR is transitive meaning
XOR XOR XOR = XOR (a | b | c) and that is seen in the assembly. But you need a different extra lookup that slows things down again.
BUT: Clang can do one more optimisation with your code that it did not see with the existing calls.

Proof:
https://godbolt.org/z/KT6znYzeG

Notice how there is vpunpcklqdq where there was none before.

I can also see another optimisation that you missed - if its faster by more than 140% I will add myself to authors of this shared algo

I've been told by Mar and Gerd that the compilers do not understand my C++ code very well however Clang did do the best. If I write the SISSY rook getter in 80486 assembler I could write it in 20 assembler instructions with almost no dependency chains so the CPU could execute 4 instructions at the same time. However I suppose the cache misses would still slow it down.

Turns out that a union is of benefit after all.

_DATA SEGMENT

bbs STRUCT
r1 BYTE ?
r2 BYTE ?
r3 BYTE ?
r4 BYTE ?
r5 BYTE ?
r6 BYTE ?
r7 BYTE ?
r8 BYTE ?
bbs ENDS

bbu UNION
bbs<>
b64 QWORD ?
bbu ENDS

occ bbu<>

_DATA ENDS

_TEXT SEGMENT

RayAttacks PROC

; rcx = sq
; rdx = address of rss
; r8 = occ

shl rcx, 11 ; sq * 2048
mov occ.b64, r8
add rdx, rcx
movzx r8, occ.r1
movzx r9, occ.r2
mov rax, [rdx + r8 * 8]
mov rcx, [rdx + r9 * 8 + 131072]
movzx r8, occ.r3
movzx r9, occ.r4
and rax, [rdx + r8 * 8 + (2 * 131072)]
and rcx, [rdx + r9 * 8 + (3 * 131072)]
movzx r8, occ.r5
movzx r9, occ.r6
and rax, [rdx + r8 * 8 + (4 * 131072)]
and rcx, [rdx + r9 * 8 + (5 * 131072)]
movzx r8, occ.r7
movzx r9, occ.r8
and rax, [rdx + r8 * 8 + (6 * 131072)]
and rcx, [rdx + r9 * 8 + (7 * 131072)]
and rax, rcx
ret

RayAttacks ENDP

TEXT ENDS

END

So 20 cheap instructions executing in parallel! This SISSY is working out and getting stronger!!

Mike Sherwin · Post by **Mike Sherwin** » Thu Mar 03, 2022 9:24 pm

dangi12012 wrote: ↑Thu Mar 03, 2022 7:02 pm
Mike Sherwin wrote: ↑Thu Mar 03, 2022 9:35 am IMHO this will now perform much faster.

Someone please let me know if this is wrong. It is still a bit confusing for me.
Your code was correct but only 5% faster. But not for the reason you think. Any relevant compiler will figure out that XOR is transitive meaning
XOR XOR XOR = XOR (a | b | c) and that is seen in the assembly. But you need a different extra lookup that slows things down again.
BUT: Clang can do one more optimisation with your code that it did not see with the existing calls.

Proof:
https://godbolt.org/z/KT6znYzeG

Notice how there is vpunpcklqdq where there was none before.

I can also see another optimisation that you missed - if its faster by more than 140% I will add myself to authors of this shared algo

I forgot to say thanks for all the work. So thanks! And I'm looking forward to see what you can do. And you are very welcome to add your name.

Mergi · Post by **Mergi** » Thu Mar 03, 2022 9:45 pm

Downloaded Visual Studio just for this since CLion was reporting infinite errors for me. Run on the new intel Alder lake - i5 12600.

Code: Select all

Summary:
Name                               Performance [MQueens/s]       Threads
Pext constexpr                     6554.064203                   12
Black Magic BB - Fixed shift       4738.681462                   12
Plain Magic BB                     3886.230599                   12
Fancy Magic BB - Variable shift    2239.493467                   11
SISSY Bitboards                    2227.502228                   12
Kindergarten                       2049.931199                   20
Obstruction Difference Inline      1841.383247                   5
Hyperbola Quintessence o^(o-2r)    1650.040220                   12
Obstruction Difference             1602.430353                   12
Classical Bob-Mike                 1579.920960                   18
Slide Arithmetic                   1476.564158                   12
Rotated Bitboards                  1455.383403                   12
HyperCube                          1419.521976                   12
Leorik                             1412.433535                   12
SBAMG Inline                       1397.168405                   12
SBAMG o^(o-3cbn)                   1391.312413                   12
Hyperbola Quintessence Inline      1241.529150                   12
Slide Arithmetic Inline            1235.689811                   12
Leorik Inline                      1225.377569                   12
AVX Branchless Shift               1136.519693                   12
QBBEngine                          1053.545113                   20
Exploding Bitboards                988.461768                    12
Reference (Switch Lookup)          766.453603                    12
Kogge-Stone                        700.052358                    12
Pext Emulated                      692.727658                    12
Dumb7 Fill                         440.855452                    16
Binary Neural Network              232.960289                    12

Mike Sherwin · Post by **Mike Sherwin** » Thu Mar 03, 2022 10:02 pm

I don't mean to be a pesterer or a pain but in Bricabrac I do SISSY a little differently.

Code: Select all

union SplitBB {
  struct {
    u08 r1;
    u08 r2;
    u08 r3;
    u08 r4;
    u08 r5;
    u08 r6;
    u08 r7;
    u08 r8;
  };
  struct {
    u32 l32;
    u32 h32;
  };
  u64 bb;
};

void InitializeBSS() {
  u08 sq, sqr, k, l;
  s08 x, dx, y, dy;
  u64 b, bb, i, j, ii, jj;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      for (k = 8, l = 0; k <= 48; k += 8, l++) {
        bb = 0;
        b = ((u64)i << k) & bob[sq];
        for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        bss[sq][i][l] = bb;
      }
    }
  }

  for (sq = 0; sq <= 63; sq++) {
    for (ii = 0; ii <= 127; ii++) {
      for (jj = 0; jj <= 127; jj++) {
        i = ((ii << 8) & bob[sq]) >> 8;
        j = ((jj << 32) & bob[sq]) >> 32;
        bss[sq][i | j][0] = bss[sq][i][0] & bss[sq][j][3];
        i = ((ii << 16) & bob[sq]) >> 16;
        j = ((jj << 40) & bob[sq]) >> 40;
        bss[sq][i | j][1] = bss[sq][i][1] & bss[sq][j][4];
        i = ((ii << 24) & bob[sq]) >> 24;
        j = ((jj << 48) & bob[sq]) >> 48;
        bss[sq][i | j][2] = bss[sq][i][2] & bss[sq][j][5];
      }
    }
  }
}

void InitializeQSS() {
  u08 sq, sqr, k, l;
  s08 x, y, dx, dy;
  s32 i;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    bob[sq] = 0;
    rob[sq] = 0;
    for (i = 0; i < 256; i++) {
      for (k = 0, l = 0; k <= 56; k += 8, l++) {
        bb = 0;
        b = (u64)i << k;
        for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          bob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          bob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          bob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          bob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1; x + dx > -1; dx--) {
          sqr = (y << 3) + x + dx;
          bb |= one << sqr;
          rob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = +1; x + dx < +8; dx++) {
          sqr = (y << 3) + x + dx;
          bb |= one << sqr;
          rob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dy = +1; y + dy < +8; dy++) {
          sqr = ((y + dy) << 3) + x;
          bb |= one << sqr;
          rob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dy = -1; y + dy > -1; dy--) {
          sqr = ((y + dy) << 3) + x;
          bb |= one << sqr;
          rob[sq] |= one << sqr;
          if ((one << sqr) & b) break;
        }
        qss[sq][i][l] = bb;
      }
    }
    b7e[sq] = bob[sq] & 0x007e7e7e7e7e7e00;
  }
}

void InitializeRNK() {
  u08 sq, sqr, i;
  s08 x, y, dx, dy;
  u64 bb, b;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      bb = 0;
      b = (u64)i << (sq & 56);
      for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= one << sqr;
        bob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= one << sqr;
        bob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= one << sqr;
        bob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
        sqr = (((y + dy) << 3) + x + dx);
        bb |= one << sqr;
        bob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dx = -1; x + dx > -1; dx--) {
        sqr = (y << 3) + x + dx;
        bb |= one << sqr;
        rob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dx = +1; x + dx < +8; dx++) {
        sqr = (y << 3) + x + dx;
        bb |= one << sqr;
        rob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dy = +1; y + dy < +8; dy++) {
        sqr = ((y + dy) << 3) + x;
        bb |= one << sqr;
        rob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      for (dy = -1; y + dy > -1; dy--) {
        sqr = ((y + dy) << 3) + x;
        bb |= one << sqr;
        rob[sq] |= one << sqr;
        if ((one << sqr) & b) break;
      }
      qss[sq][i][0] = bb;
    }
  }
}

    case WB:
    case BB:
      blocks.bb = aPieces & b7e[fs];
      blocks.bb = ((blocks.l32 >> 8) | blocks.h32);
      bb = bss[fs][blocks.r1][0]
        & bss[fs][blocks.r2][1]
        & bss[fs][blocks.r3][2]
        & enemy;
      break;
    case WR:
    case WRC:
    case BR:
    case BRC:
      blocks.bb = aPieces & rob[fs];
      bb = qss[fs][(blocks.bb >> (fs & 56)) & 127][0]
        & qss[fs][blocks.r2][1]
        & qss[fs][blocks.r3][2]
        & qss[fs][blocks.r4][3]
        & qss[fs][blocks.r5][4]
        & qss[fs][blocks.r6][5]
        & qss[fs][blocks.r7][6]
        & rob[fs]
        & enemy;
      break;
    case WQ:
    case BQ:
      blocks.bb = aPieces & (rob[fs] | bob[fs]);
      bb = qss[fs][(blocks.bb >> (fs & 56)) & 127][0]
        & qss[fs][blocks.r2][1]
        & qss[fs][blocks.r3][2]
        & qss[fs][blocks.r4][3]
        & qss[fs][blocks.r5][4]
        & qss[fs][blocks.r6][5]
        & qss[fs][blocks.r7][6]
        & enemy;
      break;

This is just for information. I don't expect you to use up all your time for my code. I would add it myself but I'm not very fast and I must focus on my new project if I'm going to get anywhere with it. It is just that using a union in my test seemed a bit faster so that is the way I left it. Also the bishop code was folded a bit so only 3 lookups is needed rather than 6. But it seems to add some dependencies. With one more fold it can be reduced to only 2 lookups. Not sure if it would be worth it. Also the rook can be folded down in theory to just 2 lookups as well. I just never got around to optimizing the idea any further. It took me many months on and off to figure out how to fold the bishop down to 3 lookups in the initialization. I just do not have that amount of time to spare anymore.

Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms