Comparison of all known Sliding lookup algorithms
Moderator: Ras
- 
				smatovic
 - Posts: 3371
 - Joined: Wed Mar 10, 2010 10:18 pm
 - Location: Hamburg, Germany
 - Full name: Srdja Matovic
 
Re: Comparison of all known Sliding lookup algorithms
double post
			
			
													
					Last edited by smatovic on Sun Jan 16, 2022 6:37 pm, edited 2 times in total.
									
			
						
										
						- 
				smatovic
 - Posts: 3371
 - Joined: Wed Mar 10, 2010 10:18 pm
 - Location: Hamburg, Germany
 - Full name: Srdja Matovic
 
Re: Comparison of all known Sliding lookup algorithms
That is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
--
Srdja
- 
				tcusr
 - Posts: 325
 - Joined: Tue Aug 31, 2021 10:32 pm
 - Full name: tcusr
 
Re: Comparison of all known Sliding lookup algorithms
yes, to be competitive -mavx2 is required.smatovic wrote: ↑Sun Jan 16, 2022 6:36 pmThat is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
--
Srdja
btw i have no fixed sized vector in my code, i templated the directions so that the functions will use a vector big enough to contain the number of directions, the only downside is that they have to have the same sign because of shifting, which i have not found a solution yet.
- 
				dangi12012
 - Posts: 1062
 - Joined: Tue Apr 28, 2020 10:03 pm
 - Full name: Daniel Infuehr
 
Re: Comparison of all known Sliding lookup algorithms
UPDATE: Zero Table Algorithms added
I saw that many of the 2kb algorithms all take the same ray as inputs.
http://www.talkchess.com/forum3/viewtop ... =7&t=79140
These 64x4 rays can be calculated at runtime to take the tablesize of some algorithms down to ZERO! I called this versions INLINE. They will be slower on the CPU - but for the CUDA thread its important to have them here for comparison:
New 0 byte lookup versions:
Obstruction-difference Inline
SlideArithmetic Inline
HyperbolaQuiesc. Inline
Source: https://github.com/Gigantua/Chess_Movegen
The 0kb versions are garuanteed to never ever have any issues with cache pressure - and I am looking forward to a new cuda comparison for these.
Multithread Results:
I still need Dumb7fill the code never produces correct results for me. (Upside down and if i fix this - its gets wrong results)
If you have a repo where Dumb7fill is imeplemnted let me know!
And as always - please run the sourcecode and post results!
			
			
									
						
							I saw that many of the 2kb algorithms all take the same ray as inputs.
http://www.talkchess.com/forum3/viewtop ... =7&t=79140
These 64x4 rays can be calculated at runtime to take the tablesize of some algorithms down to ZERO! I called this versions INLINE. They will be slower on the CPU - but for the CUDA thread its important to have them here for comparison:
New 0 byte lookup versions:
Obstruction-difference Inline
SlideArithmetic Inline
HyperbolaQuiesc. Inline
Source: https://github.com/Gigantua/Chess_Movegen
The 0kb versions are garuanteed to never ever have any issues with cache pressure - and I am looking forward to a new cuda comparison for these.
Code: Select all
AMD Ryzen 9 5950X 16-Core Processor
NAME            Million LU/s    Tablesize Instructions
Exploading:     136.67MOps      6 kB    Optimal perf: imul64
Reference:      127.53MOps      0 kB    Optimal perf: none
KoggeStone:     177.12MOps      0 kB    Optimal perf: none
RotatedBoard:   109.57MOps      14 kB   Optimal perf: none
QBB Algo:       207.42MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        318.09MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    373.76MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  196.72MOps      0 kB    Optimal perf: countl_zero
SlideArithm:    363.12MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  217.14MOps      0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  523.18MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  264.36MOps      0 kB    Optimal perf: bswap
SISSY BB:       228.16MOps      1409 kB Optimal perf: none
Hash Variable:  598.54MOps      729 kB  Optimal perf: imul64
Hash Plain:     641.27MOps      2306 kB Optimal perf: imul64
Hash Fancy:     792.53MOps      694 kB  Optimal perf: imul64
Pext  :         911.42MOps      843 kB  Optimal perf: pext_u64
HyperCube:      263.97MOps      841 kB  Optimal perf: noneMultithread Results:
Code: Select all
Pext  :         10265.80MOps    30Threads
Hash Fancy:     8878.84MOps     32Threads
Hash Variable:  7768.64MOps     31Threads
Hyperbola Qsc:  6700.89MOps     31Threads
Hash Plain:     6089.48MOps     30Threads
Obstr. Diff:    4973.52MOps     32Threads
SlideArithm:    4834.04MOps     32Threads
HyperCube:      4249.17MOps     31Threads
BobMike:        4193.42MOps     32Threads
Hyperb.Inline:  3296.19MOps     32Threads
SISSY BB:       3022.02MOps     31Threads
QBB Algo:       3001.00MOps     32Threads
SlideA Inline:  2850.78MOps     31Threads
Obstr. Inline:  2670.50MOps     32Threads
Exploading:     2300.39MOps     32Threads
KoggeStone:     2213.04MOps     32Threads
RotatedBoard:   2164.86MOps     31Threads
Reference:      1827.13MOps     31ThreadsI still need Dumb7fill the code never produces correct results for me. (Upside down and if i fix this - its gets wrong results)
If you have a repo where Dumb7fill is imeplemnted let me know!
And as always - please run the sourcecode and post results!
Worlds-fastest-Bitboard-Chess-Movegenerator 
Daniel Inführ - Software Developer
			
						Daniel Inführ - Software Developer
- 
				Desperado
														 - Posts: 879
 - Joined: Mon Dec 15, 2008 11:45 am
 
Re: Comparison of all known Sliding lookup algorithms
Hello Daniel.
What would be really interesting is to know where the algorithms are described or released before translation, just with a link:
Obstr. Diff 373.76MOps 6 kB Optimal perf: countl_zero
There are some names i never heard of before.
			
			
									
						
										
						What would be really interesting is to know where the algorithms are described or released before translation, just with a link:
Obstr. Diff 373.76MOps 6 kB Optimal perf: countl_zero
There are some names i never heard of before.
- 
				Desperado
														 - Posts: 879
 - Joined: Mon Dec 15, 2008 11:45 am
 
Re: Comparison of all known Sliding lookup algorithms
Flawless comile with default project settings under vs 2022 (ISO-Standard C++20 (/std:c++20)). Compile time some seconds.
And another run ...
Poor Pext  
			
			
									
						
										
						Code: Select all
AMD Ryzen Threadripper 2950X 16-Core Processor
Megalooks Random Positions/s:
Exploading:     62.95MOps       6 kB    Optimal perf: imul64
Reference:      36.60MOps       0 kB    Optimal perf: none
KoggeStone:     76.12MOps       0 kB    Optimal perf: none
RotatedBoard:   99.91MOps       14 kB   Optimal perf: none
QBB Algo:       103.67MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        109.88MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    142.53MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  78.90MOps       0 kB    Optimal perf: countl_zero
SlideArithm:    127.36MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  99.27MOps       0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  202.67MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  100.22MOps      0 kB    Optimal perf: bswap
SISSY BB:       173.10MOps      1409 kB Optimal perf: none
Hash Variable:  274.18MOps      729 kB  Optimal perf: imul64
Hash Plain:     389.73MOps      2306 kB Optimal perf: imul64
Hash Fancy:     381.59MOps      694 kB  Optimal perf: imul64
Pext  :         39.51MOps       843 kB  Optimal perf: pext_u64
HyperCube:      201.56MOps      841 kB  Optimal perf: none
Code: Select all
Hash Plain:     3016.46MOps     14Threads
Hash Fancy:     2990.45MOps     28Threads
Hash Variable:  2398.27MOps     30Threads
HyperCube:      2094.42MOps     31Threads
Hyperbola Qsc:  1847.61MOps     29Threads
Obstr. Diff:    1587.66MOps     32Threads
SlideArithm:    1344.93MOps     30Threads
RotatedBoard:   1313.73MOps     30Threads
BobMike:        1228.22MOps     31Threads
Hyperb.Inline:  1094.03MOps     30Threads
QBB Algo:       1077.47MOps     15Threads
SlideA Inline:  1068.18MOps     31Threads
Exploading:     961.22MOps      32Threads
Obstr. Inline:  870.57MOps      31Threads
KoggeStone:     850.21MOps      31Threads
SISSY BB:       841.54MOps      30Threads
Pext  :         635.15MOps      31Threads
Reference:      611.48MOps      31Threads
Code: Select all
AMD Ryzen Threadripper 2950X 16-Core Processor
Megalooks Random Positions/s:
Exploading:     63.08MOps       6 kB    Optimal perf: imul64
Reference:      37.41MOps       0 kB    Optimal perf: none
KoggeStone:     77.15MOps       0 kB    Optimal perf: none
RotatedBoard:   99.90MOps       14 kB   Optimal perf: none
QBB Algo:       102.78MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        110.04MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    143.53MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  78.11MOps       0 kB    Optimal perf: countl_zero
SlideArithm:    126.56MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  100.76MOps      0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  205.29MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  100.29MOps      0 kB    Optimal perf: bswap
SISSY BB:       177.88MOps      1409 kB Optimal perf: none
Hash Variable:  258.25MOps      729 kB  Optimal perf: imul64
Hash Plain:     398.92MOps      2306 kB Optimal perf: imul64
Hash Fancy:     377.25MOps      694 kB  Optimal perf: imul64
Pext  :         39.66MOps       843 kB  Optimal perf: pext_u64
HyperCube:      200.75MOps      841 kB  Optimal perf: none
Code: Select all
Summary:
Hash Plain:     3218.11MOps     15Threads
Hash Fancy:     2929.85MOps     27Threads
Hash Variable:  2471.23MOps     32Threads
HyperCube:      1994.40MOps     29Threads
Hyperbola Qsc:  1922.29MOps     32Threads
Obstr. Diff:    1653.29MOps     32Threads
SlideArithm:    1386.83MOps     31Threads
RotatedBoard:   1356.66MOps     32Threads
BobMike:        1242.58MOps     31Threads
QBB Algo:       1124.42MOps     16Threads
Hyperb.Inline:  1100.71MOps     32Threads
SlideA Inline:  1070.43MOps     31Threads
Exploading:     944.86MOps      31Threads
Obstr. Inline:  874.12MOps      30Threads
SISSY BB:       873.95MOps      15Threads
KoggeStone:     855.34MOps      31Threads
Pext  :         638.44MOps      31Threads
Reference:      608.46MOps      30Threads
- 
				lithander
														 - Posts: 915
 - Joined: Sun Dec 27, 2020 2:40 am
 - Location: Bremen, Germany
 - Full name: Thomas Jahn
 
Re: Comparison of all known Sliding lookup algorithms
The friendly reception of my code snippet in your other thread made me curious how my sliding piece algorithm would do in comparison with what you have collected here. My additions are called Leorik and Leorik Lookup in the list below:
I was happy to find out Leorik tops the list of the zero-table algo's! If you want to include one or both versions on your github let me know!
			
			
									
						
										
						Code: Select all
Leorik:         164.23MOps      0 kB    Optimal perf: countl_zero
Leorik Lookup:  185.21MOps      1 kB    Optimal perf: countl_zero
Exploading:     110.41MOps      6 kB    Optimal perf: imul64
Reference:      70.35MOps       0 kB    Optimal perf: none
KoggeStone:     93.66MOps       0 kB    Optimal perf: none
RotatedBoard:   137.44MOps      14 kB   Optimal perf: none
QBB Algo:       151.66MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        169.74MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    188.54MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  120.78MOps      0 kB    Optimal perf: countl_zero
SlideArithm:    205.38MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  141.47MOps      0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  238.90MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  125.25MOps      0 kB    Optimal perf: bswap
SISSY BB:       228.70MOps      1409 kB Optimal perf: none
Hash Variable:  325.30MOps      729 kB  Optimal perf: imul64
Hash Plain:     452.61MOps      2306 kB Optimal perf: imul64
Hash Fancy:     530.11MOps      694 kB  Optimal perf: imul64
Pext  :         42.32MOps       843 kB  Optimal perf: pext_u64
HyperCube:      223.86MOps      841 kB  Optimal perf: noneCode: Select all
Summary:
Hash Fancy:     2846.70MOps     12Threads
Hash Plain:     2293.42MOps     12Threads
Hash Variable:  1843.59MOps     12Threads
HyperCube:      1553.38MOps     12Threads
Hyperbola Qsc:  1379.56MOps     12Threads
SlideArithm:    1206.84MOps     12Threads
Obstr. Diff:    1142.92MOps     11Threads
BobMike:        1078.42MOps     12Threads
Leorik Lookup:  1074.17MOps     12Threads
Leorik:         967.64MOps      12Threads
QBB Algo:       887.62MOps      12Threads
SlideA Inline:  854.31MOps      12Threads
SISSY BB:       784.96MOps      20Threads
Obstr. Inline:  759.87MOps      18Threads
Hyperb.Inline:  754.54MOps      12Threads
RotatedBoard:   676.31MOps      5Threads
Exploading:     632.57MOps      18Threads
KoggeStone:     550.17MOps      12Threads
Pext  :         366.33MOps      12Threads
Reference:      350.86MOps      5Threads- 
				dangi12012
 - Posts: 1062
 - Joined: Tue Apr 28, 2020 10:03 pm
 - Full name: Daniel Infuehr
 
Re: Comparison of all known Sliding lookup algorithms
Update: 
lithander was nice and sent me some sourcecode. We have to discuss if these algorithms are the same:
https://github.com/Gigantua/Chess_Moveg ... 9c999b84e1
https://github.com/Gigantua/Chess_Moveg ... Arithm.hpp
I am still missing these:
Dumb7Fill (someone please provide me a working solution - I never found a implementation that works)
Kindergarten Boards (https://www.chessprogramming.org/Kindergarten_Bitboards)
SBAMG (https://www.chessprogramming.org/SBAMG)
Bitfoot is in the wiki but is identical to Obstruction Difference: https://www.chessprogramming.org/Bitfoot#ABBitboards - so IMO not really its own algo
I will also write an AVX2 version of either Obstruction Difference, Slider Arithm, or SBAMG. Modern compilers do that by themselves normally so I dont expect a difference.
Todos:
Add above algos,
Print reference URL
Print algo Author
Table formatting and algo naming.
			
			
									
						
							lithander was nice and sent me some sourcecode. We have to discuss if these algorithms are the same:
https://github.com/Gigantua/Chess_Moveg ... 9c999b84e1
https://github.com/Gigantua/Chess_Moveg ... Arithm.hpp
I am still missing these:
Dumb7Fill (someone please provide me a working solution - I never found a implementation that works)
Kindergarten Boards (https://www.chessprogramming.org/Kindergarten_Bitboards)
SBAMG (https://www.chessprogramming.org/SBAMG)
Bitfoot is in the wiki but is identical to Obstruction Difference: https://www.chessprogramming.org/Bitfoot#ABBitboards - so IMO not really its own algo
I will also write an AVX2 version of either Obstruction Difference, Slider Arithm, or SBAMG. Modern compilers do that by themselves normally so I dont expect a difference.
Todos:
Add above algos,
Print reference URL
Print algo Author
Table formatting and algo naming.
Worlds-fastest-Bitboard-Chess-Movegenerator 
Daniel Inführ - Software Developer
			
						Daniel Inführ - Software Developer
- 
				hgm
														 - Posts: 28402
 - Joined: Fri Mar 10, 2006 10:06 am
 - Location: Amsterdam
 - Full name: H G Muller
 
Re: Comparison of all known Sliding lookup algorithms
Still no incrementally updated slider tables orth[64] and diag[64]?
			
			
									
						
										
						- 
				lithander
														 - Posts: 915
 - Joined: Sun Dec 27, 2020 2:40 am
 - Location: Bremen, Germany
 - Full name: Thomas Jahn
 
Re: Comparison of all known Sliding lookup algorithms
I don't see how that's going to work in the testing framework Daniel has set up. If an algorithm can't be used in such a way where it accepts a square and an occupation-bitboard as parameters and returns the correct queen targets as a bitboard he's basically out.