Why is there no bulk evaluation happening in Alpha Beta?

dangi12012 · Post by **dangi12012** » Fri Nov 05, 2021 9:30 pm

diep wrote: ↑Fri Nov 05, 2021 9:20 pm
dangi12012 wrote: ↑Fri Nov 05, 2021 9:05 pm
diep wrote: ↑Fri Nov 05, 2021 8:39 pm
dangi12012 wrote: ↑Fri Nov 05, 2021 4:46 pm I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.

Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.

Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)

Am I missing something?
Yes quite a lot you miss.

An evaluation at the gpu would only be useful if it removes computational effort from the CPU.
Ummmm... You are talking about an "evaluation function" which is already solved in the form of a neural network which doesnt suffer from any horizon effect as bad. You know that NNUE and any neural network is linear algebra (GEMM) with an activation function (RELU) on the output. A GPU is not just good - its perfect for that and even has special matrix multiplication hardware (Tensor Cores) which is two orders of magnitude faster (on a single gpu) than your cpu and 10x faster than a 64 core threadripper.

NNUE circumvents that GEMM cost by incrementally updating the first overparametrizised layer and using 8 bits between as AVX2 provides some very cheap 32x8 dot products.

My question is why not put all leaf nodes in a vector - send that to the gpu - and have all the evaluation done in bulk. Then you minmax + prune dead nodes - and send all leaf nodes to the gpu again?
Neural net - too slow obviously.
Classic chessprograms completely annihilate all the neural net programs if you change search a little of course as i wrote down.
Not easy to test at classic chessprograms what changes need to get made there - because it's all bullet optimized so far and less nps in bullet means lower elo. Things change if you got 64 cores of course and already know you're gonna search millions of nodes anyway.

Fast AMD cpu now: 64 cores. 5 Tflops double precision.
Fast Nvidia gpu now: 10 Tflops on paper - but no one manages to get it out - $10k and price going up.

At Nvidia if you can get 25% out of it that's considered very good.

In chess important is to order moves correctly. This is ugly slow on gpu's.
I did tests here trying to get prime number sieve faster in CUDA.

Basically whether you use the L1 datacache or the register file - same slow speed there.
All the cudacores of a single warp need to atomically execute the same code.

So anything that 'prunes' or needs data from a different cacheline or register file than another core causes extra bandwidth on the gpu directly slowing it down at least factor 32.

Anything neural net related is too slow of course to even take it serious for realtime applications.
It means there is a better solution possible realtime on CPU's.

We can mathematically even proof this.
It's just a lazy way to get things done those neural nets.

If you want to parameter tune with neural nets something in the slow manner then use it on a CPU realtime in a search - now that's an entirely different discussion

Think about it. If you have a single layer network - its literally the same as the positional score used in stockfish and others for the last 30 years or so. https://www.chessprogramming.org/Simpli ... are_Tables
You would multiply each occupation type by its own worth per square! Mathematically identical except the activation part. (which is really the same for positives with RELU)
With more layers you get the advanced parts so that the network can evaluate forks and pins etc.

You are unlucky there because I have an RTX 3080 and get real end to end 140 Tflops for fp16. fp64 is not needed in neural networks at all. The sweetspot seems to be 8 int or 16 bits float depending on the problem domain.

There are no high end "classical chessprogramms" anymore - all top ones use neural networks. Look at the TCEC please!

My question is for an expert with good knowledge of why bulk evaluation was not tried before - or why it failed.
My guess is that there are some reasons that some people on this board can answer.

diep · Post by **diep** » Fri Nov 05, 2021 9:32 pm

So if you are convinced your neural net proggie is good.

tune parameters with it until you satisfied. then write over parameters to a piece of paper.

As your neural net will be utmost tiny anyway of course. (as a side note: That's how advanced the ANNs are... )
Then throw those parameters in a normal chessprogram and have it search with those parameters factor 100 faster.

Of course the software on cpu wins from gpu then

We can prove your total NPS can never be very high on a gpu - you cannot even sort a list of 40 chessmoves with it very fast.
Somehow you need to select the 'best' move from a list with 1 cuda core and pop it up to above and then play it as the first move.

Too slow on a GPU to do that!
Factor 32+ slower!

And that's if you use warps of 32 cudacores. If you use warps of 128 cudacores - oh boy there you go...

diep · Post by **diep** » Fri Nov 05, 2021 9:38 pm

Please note that in FPGA i remember Chrilly reporting the same problem. Namely that ordering moves is a big issue there.
For an entirely different reason in this case.

At the GPU it's a bandwidht issue. All 32 cuda cores read from a different cacheline. Note in all cases we assume here still we do it without bank conflicts - as otherwise that factor 32x will grow rapidly to an even bigger slowdown at gpu's.

You realize that if you program in CUDA you need to take into account you may not get bank conflicts?

Any clue what that is?

Everywhere limitations on GPU which are not there on CPU.

Though of course with AVX512 they ask there for similar problems if you want to vectorize your code.

Also all sorts of caching problems on gpu's. Pawnhashtable huh? Transpositiontable huh?

The L2 cache is very primitive on GPU's and sits on a central spot so very far away from each SIMD.

Whereas cpu's , each core has its own L1 and L2 cache.
Basically the SRAM that forms the L3 cache comes closer to the L2 cache of a gpu.
Yet the L2 cache from a gpu is ultra tiny and usually read-only from a SIMDs viewpoint seen.

So all kinds of things you have to worry about on a GPU you do not have to worry at a CPU about.

Daniel Shawul · Post by **Daniel Shawul** » Fri Nov 05, 2021 9:47 pm

Actually Scorpio does batched NN evaluation for its alpha-beta search.
If I launch 80 threads for ABDADA style parallel search, I can evaluate 80 position at a time.
The other algorithmic change I made is to re-formulate alpha-beta search in rollouts form just like MCTS.
One rollout to a given depth is done by one thread, and then somehow the full width alpha-beta search is complete -- that was fun to implement.
All those algorithms like null move, IID etc.. are suddenly hard to implement in rollouts form.

The main reason Scorpio launches as many as 600 threads for its search is to try out existing algorithms in their original recursive form,
without the need to rewrite it. It was easy for me to use any NN (not just NNUE) with alpha-beta and other hybrid algorithms.
Lc0 mcts is spaghetti code because the single thread approach needs batch collection by a single (few) threads etc.. really ugly code IMO.

If you look at Scorpio, there is no search algorithm I haven't tried

Lc0 is just a copy of Deepmind, so you are off mark there.

Sopel · Post by **Sopel** » Fri Nov 05, 2021 10:06 pm

NNUE is memory bandwidth limited and it's literally faster on cpu than on gpu (on high end cpus by an order of magnitude). Even with the incremental accumulator updates.

j.t. · Post by **j.t.** » Sat Nov 06, 2021 12:11 am

To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.

But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.

smatovic · Post by **smatovic** » Sat Nov 06, 2021 7:21 am

It is the effective branching factor of ~2 of modern AB engines and the serial nature of the algorithm which makes bulk evaluation/batches not fit with AB -> MCTS-PUCT of Lc0 with NN-eval batches on GPU.

"AB search with NN on GPU..."
https://www.talkchess.com/forum3/viewto ... =7&t=74771

Since Hydra by Donninger there is the concept to offload 3-plies-AB+QS+eval to an FPGA accelerator.

Ankan showed impressive results with his perft_gpu, but that approach would fit only NegaMax not AlphaBeta IMO.

With Zeta I ported a quite simple engine to OpenCL to run on a GPU, but my NPS throughput per worker is pretty lame, I am sure Vincent could do here some magic with an 32-bit optimized GPU-engine

So either you develop an alternative of AB for GPU or you get the AB engine approach fit for GPU.

--
Srdja

dangi12012 · Post by **dangi12012** » Sat Nov 06, 2021 11:26 am

j.t. wrote: ↑Sat Nov 06, 2021 12:11 am To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.

But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.

I think this is the best answer here so far. Complexity of GPU code and not every developer is confortable with it.
I am very very certain that gpu is the future of chess - as for the pruning the gpu would only ever get all unevaluated leaf nodes - so algorithmically there is no difference.

GPUs have come a long way and by bulk sending all good leaf nodes at once I think network size can grow for free and you offload all the hard neural netork GEMM work to the gpu where its much faster calculated. NNUE has the size it has because the later layers use AVX2 and the first layer is done incrementally. Nothing I see is preventing 1E6 leaf nodes to be bulk evaluated on the gpu side - except that it has not been done before.

Sopel · Post by **Sopel** » Sat Nov 06, 2021 1:30 pm

dangi12012 wrote: ↑Sat Nov 06, 2021 11:26 am
j.t. wrote: ↑Sat Nov 06, 2021 12:11 am To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.

But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.
I think this is the best answer here so far. Complexity of GPU code and not every developer is confortable with it.
I am very very certain that gpu is the future of chess - as for the pruning the gpu would only ever get all unevaluated leaf nodes - so algorithmically there is no difference.

GPUs have come a long way and by bulk sending all good leaf nodes at once I think network size can grow for free and you offload all the hard neural netork GEMM work to the gpu where its much faster calculated. NNUE has the size it has because the later layers use AVX2 and the first layer is done incrementally. Nothing I see is preventing 1E6 leaf nodes to be bulk evaluated on the gpu side - except that it has not been done before.

What leaf nodes? You don't know the leaf nodes a priori. Do you know how AB search works in modern engines?

Madeleine Birchfield · Sat Nov 06, 2021 2:06 pm

smatovic wrote: ↑Sat Nov 06, 2021 7:21 am It is the effective branching factor of ~2 of modern AB engines and the serial nature of the algorithm which makes bulk evaluation/batches not fit with AB -> MCTS-PUCT of Lc0 with NN-eval batches on GPU.

So bulk evaluation/batches would work with AB on the GPU much better for games with higher effective branching factor, such as Go?

Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?