Comparison of all known Sliding lookup algorithms [CUDA]

Joost Buijs · Post by **Joost Buijs** » Sun Mar 06, 2022 7:53 am

I ran the provided executable on both my RTX-2060 Super and my RTX-3090 (both default clock).

NVIDIA GeForce RTX 2060 SUPER
FancyHash:      3.85 GigaQueens/s
QBB Algo:       31.06 GigaQueens/s
Bob Lookup:     0.45 GigaQueens/s
Kogge Stone:    20.39 GigaQueens/s
Hyperbola Qsc:  8.85 GigaQueens/s
Switch Lookup:  0.28 GigaQueens/s
Slide Arithm:   9.49 GigaQueens/s
Pext Lookup:    8.53 GigaQueens/s
SISSY Lookup:   4.04 GigaQueens/s
Hypercube Alg:  0.64 GigaQueens/s

Code: Select all

NVIDIA GeForce RTX 3090
FancyHash:      9.00 GigaQueens/s
QBB Algo:       75.24 GigaQueens/s
Bob Lookup:     0.94 GigaQueens/s
Kogge Stone:    44.66 GigaQueens/s
Hyperbola Qsc:  20.43 GigaQueens/s
Switch Lookup:  6.74 GigaQueens/s
Slide Arithm:   21.58 GigaQueens/s
Pext Lookup:    19.71 GigaQueens/s
SISSY Lookup:   9.64 GigaQueens/s
Hypercube Alg:  1.62 GigaQueens/s

The RTX-2060 Super seems to be on par with the RTX-2070, the RTX-3090 outperforms the RTX-3080 by a small margin.

smatovic · Post by **smatovic** » Sun Mar 06, 2022 9:08 am

smatovic wrote: ↑Sun Mar 06, 2022 7:28 am
dangi12012 wrote: ↑Sat Mar 05, 2022 8:17 pm [...]
So the top performance is around 60 Billion Lookups/s...
[...]
NVIDIA GeForce RTX 3080
[...]
May I ask how many threads you run concurrently on the device, multiple waves of Warps/SIMT? Lookups/thread? The RTX 3080 has 8704 Cuda-cores according to Wikipedia. What does the profiler say about the GPU utilization?

--
Srdja

Ah, okay, just looked into kernel.cu, you run a loop of 256 iterations with 4096x256 threads.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 06, 2022 9:25 am

smatovic wrote: ↑Sun Jan 16, 2022 5:59 am Perft on gpu is one thing, a chess playing engine another.

That is the main problem with having the search on the GPU. I don't know how you did zeta but you probably were not using
one cuda thread for a standalone search (probably a warp or maybe even a block?).
My attempt at implementing MCTS on the GPU for Hex using one-cuda-thread-per-standalone-search was successful, since there is not
a lot of warp divergence due to the nature of the game. However, for chess there is not even space for storing the generated move list
if you use this approach. I don't recall the details much, but I believe I used some sort of bitfields to for that purpose but am not sure if there isn't register spilling even after that.

Here is my attempt at MCTS search on the GPU for chess some 10 years ago.
https://github.com/dshawul/GpuHex/blob/chess/hex.cu

The best approach is what AlphaGo demonstrated, that is to do the search on the CPU and use the GPU for the heavylifiting the evaluation, especially now that NN are the standard evaluation. Batching multiple evaluations will avoid latency of data transfer that is the main bottlneck.

Hence, here is no point in having a fast GPU move generator, unless you have the right search algorithm to go with it. I just don't see how you can avoid warp divergence even with something GPU friendly like MCTS search in chess (Hex was pretty good though). Your approach of using the warp/block for a standalone search is probably the more feasible one in this regard, especially if there are enough vector operations that can engage a warp most of the time.

smatovic · Post by **smatovic** » Sun Mar 06, 2022 10:12 am

Daniel Shawul wrote: ↑Sun Mar 06, 2022 9:25 am
smatovic wrote: ↑Sun Jan 16, 2022 5:59 am Perft on gpu is one thing, a chess playing engine another.
That is the main problem with having the search on the GPU. I don't know how you did zeta but you probably were not using
one cuda thread for a standalone search (probably a warp or maybe even a block?).
My attempt at implementing MCTS on the GPU for Hex using one-cuda-thread-per-standalone-search was successful, since there is not
a lot of warp divergence due to the nature of the game. However, for chess there is not even space for storing the generated move list
if you use this approach. I don't recall the details much, but I believe I used some sort of bitfields to for that purpose but am not sure if there isn't register spilling even after that.

Here is my attempt at MCTS search on the GPU for chess some 10 years ago.
https://github.com/dshawul/GpuHex/blob/chess/hex.cu

The best approach is what AlphaGo demonstrated, that is to do the search on the CPU and use the GPU for the heavylifiting the evaluation, especially now that NN are the standard evaluation. Batching multiple evaluations will avoid latency of data transfer that is the main bottlneck.

Hence, here is no point in having a fast GPU move generator, unless you have the right search algorithm to go with it. I just don't see how you can avoid warp divergence even with something GPU friendly like MCTS search in chess (Hex was pretty good though). Your approach of using the warp/block for a standalone search is probably the more feasible one in this regard, especially if there are enough vector operations that can engage a warp most of the time.

I remember your GPU Hex

I tried a 'one thread one board' approach with Best-First-MiniMax-Search first, with thousands of threads in parallel, but the 'one SIMD-unit one board' approach with parallel AlphaBeta performed better for me. Here the blue print of the idea:

https://zeta-chess.app26.de/post/how-co ... n-on-gpus/

As soon as I try to synch the work shared across threads of a block/work-group I loose a lot of cycles, hence it remains open if I will be able to add NNUE eval to Zeta and compete with CPUs, my current NPS throughput per block/work-group is pretty lame to be honest.

Generating moves on GPU can be pretty fast, as Ankan showed, but as you said, what kind of algorithm to feed it with? And Lc0 showed that the GPU as ANN accelerator approach works.

--
Srdja

tcusr · Post by **tcusr** » Sun Mar 06, 2022 10:35 am

Hyperbola Quintessence would work with the reverse bit intrinsic instead of byteswap.
remember we couldn't use byteswap on ranks because they wouldn't change position? now you can use bit reverse and don't need the rank attacks table anymore

Code: Select all

	__device__ uint64_t attack(uint64_t pieces, uint32_t x, uint64_t mask) {
		uint64_t o = pieces & mask;

		return ((o - (1ull << x)) ^ bit_reverse(bit_reverse(o) - bit_reverse(1ull << x))) & mask;
	}

this would work for rooks and bishops the same, you only need the tables for the masks, like the sliding arithmetic algo
(i think that instead of bit_reverse(1ull << x) you can do 1ull << (x ^ 63))

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 06, 2022 4:43 pm

smatovic wrote: ↑Sun Mar 06, 2022 10:12 am
Daniel Shawul wrote: ↑Sun Mar 06, 2022 9:25 am
smatovic wrote: ↑Sun Jan 16, 2022 5:59 am Perft on gpu is one thing, a chess playing engine another.
That is the main problem with having the search on the GPU. I don't know how you did zeta but you probably were not using
one cuda thread for a standalone search (probably a warp or maybe even a block?).
My attempt at implementing MCTS on the GPU for Hex using one-cuda-thread-per-standalone-search was successful, since there is not
a lot of warp divergence due to the nature of the game. However, for chess there is not even space for storing the generated move list
if you use this approach. I don't recall the details much, but I believe I used some sort of bitfields to for that purpose but am not sure if there isn't register spilling even after that.

Here is my attempt at MCTS search on the GPU for chess some 10 years ago.
https://github.com/dshawul/GpuHex/blob/chess/hex.cu

The best approach is what AlphaGo demonstrated, that is to do the search on the CPU and use the GPU for the heavylifiting the evaluation, especially now that NN are the standard evaluation. Batching multiple evaluations will avoid latency of data transfer that is the main bottlneck.

Hence, here is no point in having a fast GPU move generator, unless you have the right search algorithm to go with it. I just don't see how you can avoid warp divergence even with something GPU friendly like MCTS search in chess (Hex was pretty good though). Your approach of using the warp/block for a standalone search is probably the more feasible one in this regard, especially if there are enough vector operations that can engage a warp most of the time.
I remember your GPU Hex

I tried a 'one thread one board' approach with Best-First-MiniMax-Search first, with thousands of threads in parallel, but the 'one SIMD-unit one board' approach with parallel AlphaBeta performed better for me. Here the blue print of the idea:

https://zeta-chess.app26.de/post/how-co ... n-on-gpus/

As soon as I try to synch the work shared across threads of a block/work-group I loose a lot of cycles, hence it remains open if I will be able to add NNUE eval to Zeta and compete with CPUs, my current NPS throughput per block/work-group is pretty lame to be honest.

Agreed! But there is always this one guy that gets excited with the theoretical max FLOPs of GPUs and is hell bent with getting AB search or something else on GPUs, then blames it, then maybe try MCTS, and then disappears

You remember that guy who was convinced C++ new features like std::future and async etc, will get a super fast AB search on GPU? I wonder what happened to him...

Generating moves on GPU can be pretty fast, as Ankan showed, but as you said, what kind of algorithm to feed it with? And Lc0 showed that the GPU as ANN accelerator approach works.

Simpler algorithms always win! It is even possible to use AB gpu search with it with ABDADA however with tiny NNUE the batch size must be very large I presume.

dangi12012 · Post by **dangi12012** » Sun Mar 06, 2022 4:52 pm

tcusr wrote: ↑Sun Mar 06, 2022 10:35 am Hyperbola Quintessence would work with the reverse bit intrinsic instead of byteswap.
remember we couldn't use byteswap on ranks because they wouldn't change position? now you can use bit reverse and don't need the rank attacks table anymore
Code: Select all
	__device__ uint64_t attack(uint64_t pieces, uint32_t x, uint64_t mask) {
		uint64_t o = pieces & mask;

		return ((o - (1ull << x)) ^ bit_reverse(bit_reverse(o) - bit_reverse(1ull << x))) & mask;
	}
this would work for rooks and bishops the same, you only need the tables for the masks, like the sliding arithmetic algo
(i think that instead of bit_reverse(1ull << x) you can do 1ull << (x ^ 63))

smatovic wrote: ↑Sun Mar 06, 2022 10:12 am

Joost Buijs wrote: ↑Sun Mar 06, 2022 7:53 am

Can a smart person help me (tcurs/smatovic/Joost Buijs)? I have prepared a bitrotation branch and the intrinsics in the cpp project:

Code: Select all

git clone https://github.com/Gigantua/Chess_Movegen.git --branch features/Bitrotate

Its the file Bitrotation.hpp in root.
Since my covid I have had problems concentrating for the past few months.

I think SMBAMG code o^(o-3cbn) or Hyperbola code o^(o-2r) both seem good candidates.

Please help me with this if you find the time

smatovic · Post by **smatovic** » Sun Mar 06, 2022 5:24 pm

Daniel Shawul wrote: ↑Sun Mar 06, 2022 4:43 pm [...]
Agreed! But there is always this one guy that gets excited with the theoretical max FLOPs of GPUs and is hell bent with getting AB search or something else on GPUs, then blames it, then maybe try MCTS, and then disappears You remember that guy who was convinced C++ new features like std::future and async etc, will get a super fast AB search on GPU? I wonder what happened to him...
[...]

Hehe, I remember, Percival, 'work-proof AlphaBeta search on GPU', did not read anything from him since we discussed host-device-latencies in the gpu rumors 2020 thread...

--
Srdja

dangi12012 · Post by **dangi12012** » Sun Mar 06, 2022 5:36 pm

smatovic wrote: ↑Sun Mar 06, 2022 5:24 pm
Daniel Shawul wrote: ↑Sun Mar 06, 2022 4:43 pm [...]
Agreed! But there is always this one guy that gets excited with the theoretical max FLOPs of GPUs and is hell bent with getting AB search or something else on GPUs, then blames it, then maybe try MCTS, and then disappears You remember that guy who was convinced C++ new features like std::future and async etc, will get a super fast AB search on GPU? I wonder what happened to him...
[...]
Hehe, I remember, Percival, 'work-proof AlphaBeta search on GPU', did not read anything from him since we discussed host-device-latencies in the gpu rumors 2020 thread...
Srdja

I dont want to do speculation or other stuff here in this thread. Please help me develop an algorithm with bitswap intrinsics here.
I am lacking concentration power for now

So if you are smart enough and have some time - please implement Bitrotation.hpp

Code: Select all

git clone https://github.com/Gigantua/Chess_Movegen.git --branch features/Bitrotate

It might be competitive or it might not be. It seems to be very close to 100 Billion Slider Lookups/s on current hardware - so that seems like a nice number to breach with some help here

tcusr · Post by **tcusr** » Sun Mar 06, 2022 6:03 pm

this is the new hyperbola algo, i took the slide arithmetic init code because it was cleaner

Code: Select all

#pragma once

//(c) Gerd Isenberg, Aleks Peshkov 2007
//Subtracting a Rook from a Blocking Piece - https://www.chessprogramming.org/Subtracting_a_Rook_from_a_Blocking_Piece

#include <stdint.h>
#include <array>
#include <type_traits>
#include "cu_Common.h"

//Cuda Translation by Daniel Inf�hr - Jan. 2022
//Contact: daniel.infuehr@live.de

namespace HyperbolaQsc {

	constexpr bool safe_coord(int f, int r)
	{
		return (0 <= f && f < 8) && (0 <= r && r < 8);
	}
	constexpr uint64_t init_mask(int s, int df, int dr)
	{
		uint64_t b{}; int f{}, r{};
		f = s & 7; r = s >> 3;
		while (safe_coord(f += df, r += dr))
			b |= 1ull << (f + r * 8);

		return b;
	}
	constexpr std::array<uint64_t, 256> init_array()
	{
		std::array<uint64_t, 256> a{}; int n{};
		for (int s = 0; s < 64; s++)
		{
			a[n++] = init_mask(s, 1, 0) | init_mask(s, -1, 0);
			a[n++] = init_mask(s, 0, 1) | init_mask(s, 0, -1);
			a[n++] = init_mask(s, 1, 1) | init_mask(s, -1, -1);
			a[n++] = init_mask(s, -1, 1) | init_mask(s, 1, -1);
		}
		return a;
	}

	static const std::array<uint64_t, 256> host_rank_mask = init_array();

	__constant__
		static const uint64_t rank_mask[256];

	void Init() {
		gpuErrchk(cudaMemcpyToSymbol(rank_mask, host_rank_mask.data(), sizeof(host_rank_mask)));
	}

	__inline__ __device__ uint64_t bit_reverse(uint64_t x) {
		return __brevll(x);
	}

	/* Generate attack using the hyperbola quintessence approach */
	__device__ uint64_t attack(uint64_t pieces, uint32_t x, uint64_t mask) {
		uint64_t o = pieces & mask;

		return ((o - (1ull << x)) ^ bit_reverse(bit_reverse(o) - (1ull << (x ^ 63)))) & mask;
	}

	__device__ uint64_t Queen(int sq, uint64_t occ) {
		const uint64_t* r = rank_mask + 4 * s;

		return attack(occ, s, r[0])
			 ^ attack(occ, s, r[1])
			 ^ attack(occ, s, r[2])
			 ^ attack(occ, s, r[3]);
	}
}

just copy and paste it but please test it before

Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]