Comparison of all known Sliding lookup algorithms [CUDA]

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Huge news! - New fastest Bitboard CUDA Algo

Its quite compact and pleasing to the eye -

Code: Select all

namespace Bitrotation {
#define BitFunction __inline__ __device__ uint64_t

	template<uint64_t bb>
	BitFunction mask_shift(int ranks) {
		return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
	}

	BitFunction dir_HO(int square) { return 0xFFull << (square & 56); }
	BitFunction dir_VE(int square) { return 0x0101010101010101ull << (square & 7); }
	BitFunction dir_D1(int square) { return mask_shift<0x8040201008040201ull>((square & 7) - (square >> 3)); }
	BitFunction dir_D2(int square) { return mask_shift<0x0102040810204080ull>(7 - (square & 7) - (square >> 3)); }
	BitFunction bit_reverse(uint64_t x) { return __brevll(x); }

	/* Generate attack using the hyperbola quintessence approach */
	BitFunction attack(uint64_t pieces, uint32_t x, uint64_t mask) {
		uint64_t o = pieces & mask;

		return ((o - (1ull << x)) ^ bit_reverse(bit_reverse(o) - (1ull << (x ^ 63)))) & mask;
	}

	BitFunction Queen(int s, uint64_t occ) {
		return  attack(occ, s, dir_HO(s))
	   	      ^ attack(occ, s, dir_VE(s))
		      ^ attack(occ, s, dir_D1(s))
		      ^ attack(occ, s, dir_D2(s));
	}
#undef BitFunction
}
If you read this text. As of 06.03.2022 - this is the fastest slider lookup algorithm ever created!
No x64-x86 algo ever came close. No known asic or fpga implementation scratched this performance.

We are scratching at the 100 Billion Lookup/ second mark here - which is insane since the best cpu algo can do 10Gigalookups / 16 Cores at the moment.

Code: Select all

Bitrotation o^(o-2r)     :      91.89 GigaQueens/s
Black Magic - Fixed shift:      7.41 GigaQueens/s
QBB Algo                 :      59.07 GigaQueens/s
Bob Lookup               :      1.63 GigaQueens/s
Kogge Stone              :      40.20 GigaQueens/s
Hyperbola Quiescence     :      17.59 GigaQueens/s
Switch Lookup            :      4.22 GigaQueens/s
Slide Arithm             :      18.39 GigaQueens/s
Pext Lookup              :      16.74 GigaQueens/s
SISSY Lookup             :      8.03 GigaQueens/s
Hypercube Alg            :      1.28 GigaQueens/s
Dumb 7 Fill              :      25.01 GigaQueens/s
Obstruction Difference   :      59.78 GigaQueens/s
Leorik                   :      55.59 GigaQueens/s
SBAMG o^(o-3cbn)         :      58.15 GigaQueens/s
NO HEADACHE              :      27.53 GigaQueens/s
AVX Branchless Shift     :      27.21 GigaQueens/s
Slide Arithmetic Inline  :      59.86 GigaQueens/s
If you see a way to improve above code - dont hesitate to write.
Greetings - Daniel

Special thanks to:
tcusr wrote: Sun Mar 06, 2022 10:35 am
smatovic wrote: Sun Mar 06, 2022 10:12 am
Joost Buijs wrote: Sun Mar 06, 2022 7:53 am
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Joost Buijs
Posts: 1566
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Joost Buijs »

At my RTX-3090 it surpasses 100 billion lookups/sec.

Code: Select all

Bitrotation o^(o-2r)     :      100.25 GigaQueens/s
Black Magic - Fixed shift:      8.76 GigaQueens/s
QBB Algo                 :      71.57 GigaQueens/s
Bob Lookup               :      1.96 GigaQueens/s
Kogge Stone              :      43.86 GigaQueens/s
Hyperbola Quiescence     :      20.41 GigaQueens/s
Switch Lookup            :      6.67 GigaQueens/s
Slide Arithm             :      21.15 GigaQueens/s
Pext Lookup              :      19.50 GigaQueens/s
SISSY Lookup             :      9.48 GigaQueens/s
Hypercube Alg            :      1.64 GigaQueens/s
Dumb 7 Fill              :      29.94 GigaQueens/s
Obstruction Difference   :      66.57 GigaQueens/s
Leorik                   :      61.62 GigaQueens/s
SBAMG o^(o-3cbn)         :      66.94 GigaQueens/s
NO HEADACHE              :      33.04 GigaQueens/s
AVX Branchless Shift     :      32.21 GigaQueens/s
Slide Arithmetic Inline  :      65.00 GigaQueens/s
tcusr
Posts: 323
Joined: Tue Aug 31, 2021 10:32 pm
Full name: tcusr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by tcusr »

how expensive is a branch inside a GPU? why don't you try kogge stone or dumb7fill just for D1/D2?
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

tcusr wrote: Mon Mar 07, 2022 9:40 pm how expensive is a branch inside a GPU? why don't you try kogge stone or dumb7fill just for D1/D2?
kogge stone and dumb7fill is there already and performans quite well. But I always had the bitreverse intrinsics in the back of my head.
Turns out that was the correct approach - 100 Billion Lookups/s achieved.

So what is the suprise with cuda?
Lookups are expensive - even a constexpr inline array (which is inline and not the same as __constant__) lookup is ~8x slower than 10-12 instructions!
So I can confirm here 100% that calculating the knight/king moves from scratch is cheaper than looking up one single element in an 64 slot array. mailbox algos are DOA on the current gpu architecture.

To answer your question: Most branches get compiled away into branchless code. So its very very cheap to branch. What is not cheap is having thread divergence but thats another topic. Also I am not using the advanced cuda scheduling features like graphs or streams yet.

What is still Todo
I want to see if my AST generator leads to a new algorithm - http://www.talkchess.com/forum3/viewtop ... 80#p922256
but after that I will take a hard look at cuda cores - not even the new neural network popxor network first but I want to see if I can solve sliding pieces with matrix multiplication directly. If you take the board and multiply it by some matrix and get the correct result or a unique offset directly it should be fast also.

I would not mind having all chess code eaten by linear algebra.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Dann Corbit
Posts: 12545
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Dann Corbit »

My 2080 super is looking tired:

Code: Select all

NVIDIA GeForce RTX 2080 SUPER
Bitrotation o^(o-2r)     :      55.99 GigaQueens/s
Black Magic - Fixed shift:      5.41 GigaQueens/s
QBB Algo                 :      42.39 GigaQueens/s
Bob Lookup               :      1.39 GigaQueens/s
Kogge Stone              :      28.52 GigaQueens/s
Hyperbola Quiescence     :      12.61 GigaQueens/s
Switch Lookup            :      0.53 GigaQueens/s
Slide Arithm             :      13.27 GigaQueens/s
Pext Lookup              :      12.00 GigaQueens/s
SISSY Lookup             :      5.66 GigaQueens/s
Hypercube Alg            :      0.90 GigaQueens/s
Dumb 7 Fill              :      18.86 GigaQueens/s
Obstruction Difference   :      46.23 GigaQueens/s
Leorik                   :      42.92 GigaQueens/s
SBAMG o^(o-3cbn)         :      45.85 GigaQueens/s
NO HEADACHE              :      20.46 GigaQueens/s
AVX Branchless Shift     :      20.69 GigaQueens/s
Slide Arithmetic Inline  :      45.06 GigaQueens/s
Bitrotation o^(o-2r)     :      63.09 GigaQueens/s
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Dann Corbit wrote: Fri Mar 11, 2022 1:30 am My 2080 super is looking tired:

Code: Select all

NVIDIA GeForce RTX 2080 SUPER
Bitrotation o^(o-2r)     :      55.99 GigaQueens/s
Black Magic - Fixed shift:      5.41 GigaQueens/s
QBB Algo                 :      42.39 GigaQueens/s
Bob Lookup               :      1.39 GigaQueens/s
Kogge Stone              :      28.52 GigaQueens/s
Hyperbola Quiescence     :      12.61 GigaQueens/s
Switch Lookup            :      0.53 GigaQueens/s
Slide Arithm             :      13.27 GigaQueens/s
Pext Lookup              :      12.00 GigaQueens/s
SISSY Lookup             :      5.66 GigaQueens/s
Hypercube Alg            :      0.90 GigaQueens/s
Dumb 7 Fill              :      18.86 GigaQueens/s
Obstruction Difference   :      46.23 GigaQueens/s
Leorik                   :      42.92 GigaQueens/s
SBAMG o^(o-3cbn)         :      45.85 GigaQueens/s
NO HEADACHE              :      20.46 GigaQueens/s
AVX Branchless Shift     :      20.69 GigaQueens/s
Slide Arithmetic Inline  :      45.06 GigaQueens/s
Bitrotation o^(o-2r)     :      63.09 GigaQueens/s
Its the expected performance - and a great example why benchmarking is hard. I specifically set Bitrotation as first and last algo so everyone here can see the impact when you are missing some warmup code. 55.99 vs 63.09
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

But I would not call that "looking tired" at all.
This performance you probably would not even be getting close to with a 64 core ryzen system. Cool thing about "inline calculation" in cuda:
There is no memory pressure - so this will scale and scale and scale with future architecutures and multiple gpus.

I calculated it today. The 3090 reaches 100 Billion Lookups/s. I am writing results to memory to verify and force the compiler to emit correct code.
With 100 Billion lookups/s we are going to need 800Gbit of memory bandwidth which is 1/8th of the theoretical 936.2 GByte/s.
Having a code like PEXT or Fancy magic - that will need to 2-3x the amount of memory to be read before it is written afterwards.

In any case - calculation is slow enough to not be memory bound at the moment. That leaves room for algorithmic improvement. But the point I want to make is that currently this is writing to memory which for an actual movegenerator or engine this would only be an intermediate result which gets consumed or used elsewhere. So embedded in another algorithm this sliding piece code is faster than the numbers posted.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Joost Buijs
Posts: 1566
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Joost Buijs »

dangi12012 wrote: Fri Mar 11, 2022 1:35 pm Its the expected performance - and a great example why benchmarking is hard. I specifically set Bitrotation as first and last algo so everyone here can see the impact when you are missing some warmup code. 55.99 vs 63.09
I already wondered why you did Bitrotation twice, so I left the 2nd one out of the results. You seem to use CUDA 11.4, I used CUDA 11.6, this doesn't make much difference.

What strikes me is that the 'Switch Lookup' seems to be 10 times slower on RTX-2000 devices compared to RTX-3000 devices, I wonder why this is.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Minor Update:
Before continuing this work - I will reimplement all of these shaders in GLSL for Vulkan.

That way it runs anywhere and I just saw that the most important intrinsics are already there: bitfieldReverse
https://www.khronos.org/registry/OpenGL-Refpages/gl4/

Above link is GLSL which compiles into spirv which is the equivalent of ptx.

Vulkan gives a better memorymodel than CUDA and runs on android, playstation, xbox IOS (Apple M1!! and MacOS) and almost everywhere else.
It will be interesting to see if the RDNA iGPU of a Ryzen performans faster than the 8 Cores of the system itself. (probably yes)

If a real hardware unified memory model is available like on M1 or on Consoles - I am certain that hybrid chess engines can easily offload some work the the igpu because there host device memory latency is non existent (still not single function calls but for at least 1M invocations of bulk load to be done in parallel)

All in all I can see a shift to Vk will increase boilerplate code to 1000x the current state - but there are ways to hide that code in a common header etc.
But it is the right thing to do for computerchess.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

One interesting point is if vk on nvidia could be faster than a cuda implementation.
My gut feeling is that there has to be a backend in the nvidia driver that actually translates any gpgpu language to the native HW language (of the architecture) - but there will have to be implementation specific differences.
spirV and ptx are too different to produce the same core assembly and same performance.

Not to mention that glsl opens the door to exotic hardware that are not gpus to begin with.
This will be interesting to see! - but VK is the better road forward in any case.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer