Why is there no bulk evaluation happening in Alpha Beta?

dangi12012 · Post by **dangi12012** » Fri Nov 05, 2021 4:46 pm

I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.

Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.

Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)

Am I missing something?

j.t. · Post by **j.t.** » Fri Nov 05, 2021 5:52 pm

There's Zeta by Srdja Matovic which seems to be able to run on the GPU. I am not familiar with it myself, but maybe it's close to something you are thinking of.

yeni_sekme · Post by **yeni_sekme** » Fri Nov 05, 2021 7:19 pm

-Can you explain your idea in detail or write some pseudocode? How do you choose the positions that you will send to GPU?
-The branching factor of strong chess engines is very low, I think a large proportion of the positions you evaluate will be useless to the search.

odomobo · Post by **odomobo** » Fri Nov 05, 2021 7:34 pm

What you're describing would work fine with a pure minimax approach, which does no pruning and evaluates all possible positions. However, alphabeta uses the result of every evaluation to determine how to further progress through the search tree. Notably, it will prune huge swaths of the tree when encountering a beta cutoff.

Without knowing the evaluations when progressing through the search, beta cutoffs can't be performed, which means unnecessary nodes will be searched. This is a huge deal, because the branching factor for minimax is something around 30 (that is, the branching factor for chess), whereas the average branching factor for the best engines is around 2. Compare 2^20 vs 30^20 to see why even massive performance improvements won't negate the algorithmic improvement from alphabeta.

diep · Post by **diep** » Fri Nov 05, 2021 8:39 pm

dangi12012 wrote: ↑Fri Nov 05, 2021 4:46 pm I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.

Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.

Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)

Am I missing something?

Yes quite a lot you miss.

An evaluation at the gpu would only be useful if it removes computational effort from the CPU.

There is 2 problems.

First problem is that diep's evaluation would never fit onto a GPU. Well not on a single kernel that is. You can only put a couple of thousands of instructions onto a gpu.

A GPU executes all cuda cores at once. Basically it's 1 vector of at least 32 integers of 32 bits. That last is good. No need for 64 bits in a chessprogram obviously.

So let's discuss a number of problems. Suppose you have a complex evaluation function to evaluate at a GPU. Now on paper it's useful to do this on a gpu. Yet one weakness of the gpu is that all 32 cuda cores of a single warp need to execute the same instruction at the same atomic time.

Say our vector of 32 integers is in reality 32 different positions that we want to evaluate.

That means that all the instructions we execute must be exactly the same.

In diep that would be a major slowdown of the evaluation function.

Because here is how the evaluation function works right now:

if( pattern ) {
if( pattern X && a && b && c ) {
loop through this and do that;
}
if .... // lots of if then else tatements and code to compute things
}

If the generic 'pattern' doesn't hold true you do not have to execute all this code. On a GPU it isn't very efficient to do it this way if our 32 cuda cores have different outcomes for 'pattern'.

So it's slower in terms of number of instructions needed as you can skip NOTHING at the gpu except when all 32 cuda cores skip a specific generic pattern. Good example for pattern is whether opponent has a queen on the board, or whether our side owns a passed pawn in a specific situation.

And so on.

So this is a linear slowdown.

The next problem you sit with is that there is a limited amount of codesize on all the gpu's. So a huge evaluation function simply doesn't fit in a GPU kernel.

To some extend there is tricks to avoid this by launching more kernels evaluating the same position using other code - yet that's tricky you know.

But by far the largest problem by far is that you do not know in advance which positions you are going to evaluate.

Suppose i chop off a queen somewhere of the board - no need to search the moves a3, a4, b3, b4, Nc3, Nd2 - we could already do Qd1xg4 and chop off the queen. So you are busy with all sorts of slow operations to ship positions to the GPU.

Speaking of which - there is a latency price between the GPU and CPU. This would be problem 4. The latency problem.

You can ship a limited amount of messages a second to the gpu. Sure they can be megabytes huge and the GPU can in fact read that data into its own RAM without interrupting the kernels running there executig code. Yet still there is a considerable time delay between collecting all the positions you want to evaluate - as i suppose you want to ship millions of positions at once - and then getting back all the evaluatoin results - versus we already half a second ago chopped off that queen so all those millions of positions you shipped we could skip by just capturing a queen.

So the last 2 problems are by far the biggest problem. The fact you ship positions no longer needed and the limited number of messages you can ship a second from and to the GPU.

diep · Post by **diep** » Fri Nov 05, 2021 8:59 pm

Please note that there is additional problems on a GPU - but that's total peanuts compared to how cutoffs work.

One problem is that there is no way to know where nor when your kernel gets executed. Say we have a gpu and let's take my Titan-Z gpu.

That's 2 gpu's on a single card in total 30 SIMDs - or better to call that 'cores' compared to a CPU. Each SIMD having a bunch of kernels running simulatenously on it. Say up to 20, though with 8 kernels i suppose we will it already.

So in total we launch couple of thousands of kernels which all eat from the same block of data that we also ship at same time to the GPU.

There is no way in knowing in WHICH ORDER nor WHEN which kernel finishes its task. Some day a kernel will execute and evaluate a bunch of positions and write the result back.

Are you gonna wait until ALL the kernels finished executing? Because at our cpu we have really no clue there. We can have each kernel write into the shared RAM that's shared between CPU and GPU, we can have each kernel write : "i finished my partly job".

Then within a second all kernels need to be terminated or our system is going to CRASH.
So gamer gpu's need to finish each kernel very quickly or your system is going to CRASH.

So that gives another problem and that is that we do not want the huge price of starting a kernel onto the gpu just for a few microseconds.
We want to launch it for say half a second.

So after 0.5 seconds we have back all results at millions of positions from the kernels we launched here towards the GPU.

There is technical solutions here possible - yet one really must understand how primitive GPU's work.

You ship a kernel that executes some code. The kernel can read some data, do some work with it, and kernel finishes its execution.
We launch thousands of such kernels at the same time. They all need 'something to do'. Then within a second they all finish and that's it.

How do you let something silly like this cooperate with a CPU program that is completely dependant upon this?

Searching on a GPU is even tougher as sorting small arrays - forget it. Not unless you want that factor 32 times slower than is possible on a CPU.

So if you want to write a chessprogram on a GPU - it is total waste of your time if i may say so.
Sell all your gpu's and buy a 64 core Threadripper from it!

dangi12012 · Post by **dangi12012** » Fri Nov 05, 2021 9:05 pm

diep wrote: ↑Fri Nov 05, 2021 8:39 pm
dangi12012 wrote: ↑Fri Nov 05, 2021 4:46 pm I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.

Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.

Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)

Am I missing something?
Yes quite a lot you miss.

An evaluation at the gpu would only be useful if it removes computational effort from the CPU.

Ummmm... You are talking about an "evaluation function" which is already solved in the form of a neural network which doesnt suffer from any horizon effect as bad. You know that NNUE and any neural network is linear algebra (GEMM) with an activation function (RELU) on the output. A GPU is not just good - its perfect for that and even has special matrix multiplication hardware (Tensor Cores) which is two orders of magnitude faster (on a single gpu) than your cpu and 10x faster than a 64 core threadripper.

NNUE circumvents that GEMM cost by incrementally updating the first overparametrizised layer and using 8 bits between as AVX2 provides some very cheap 32x8 dot products.

My question is why not put all leaf nodes in a vector - send that to the gpu - and have all the evaluation done in bulk. Then you minmax + prune dead nodes - and send all leaf nodes to the gpu again?

diep · Post by **diep** » Fri Nov 05, 2021 9:10 pm

Now to move to another problem.

Running kernels with a different codebase at the same time.

AMD gpu's? Forget it. In 2007 a report was written by the new Indian helpdesk that this was 'technical possible' on AMD gpu's yet "not implemented".

I can show from some years later extensive questions and 'answers' by the helpdesk from back then if you want to.

Basically the AMd helpdesk has the habit to index your question (means put it in the syatem) then some weeks later it gets put to 'resolved'.
You start to understand the problem, Sahib?

Now Nvidia claims it is both technical possible on their gpu's to do this as well as it has been implemented that you can launch simultaneously several kernels at the same time at the same GPU's. Though they do cannot run at the same SIMDs obviously.

So writing something that 'works' on a gpu with the word ' chess' inside it, it's not easy.

Much easier is to improve the search of CPU programs as i wrote down end of october a proposal.

What's the only big difference from a search viewpoint seen between LC0 and take my engine Diep?
That's that closer to the root of the search the move ordering in LC0 is much better. So it can use so to speak a larger 'reduction' factor there.

This is also much better idea.

At classical alfabeta programs the opposite is what happens. It's much more selective near the leaves than nearby the root.
Simply because larger reduction factors is very tricky there other than for nullmove.

You want the combination. AND larger nullmove reduction factors - this works at '>= beta' nodes (to avoid the lowerbound and upperbound discussion which half the programmers wouldn't understand or where i've seen too many make mistakes) AND you want to achieve a much better move ordering at '<= alfa' nodes as then you can use much larger reduction factor for reductions. The odds the first move is the best move you need to improve in classical searching chess programs. Now that's not so easy as i figured out 20 years ago. And by 2007-2012 i was convinced this was the way forwards.

Nullmove influences it. repetitions influence it, hashtables influence it. And the quiescencesearch influences it.

In short a lot of work needed there and quite some overhead to get the sorting better. Less nps in short. Yet much much improved search and much better branching factors.

Forget cooperation with GPU's for now. Waste of time.

diep · Post by **diep** » Fri Nov 05, 2021 9:20 pm

dangi12012 wrote: ↑Fri Nov 05, 2021 9:05 pm
diep wrote: ↑Fri Nov 05, 2021 8:39 pm
dangi12012 wrote: ↑Fri Nov 05, 2021 4:46 pm I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.

Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.

Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)

Am I missing something?
Yes quite a lot you miss.

An evaluation at the gpu would only be useful if it removes computational effort from the CPU.
Ummmm... You are talking about an "evaluation function" which is already solved in the form of a neural network which doesnt suffer from any horizon effect as bad. You know that NNUE and any neural network is linear algebra (GEMM) with an activation function (RELU) on the output. A GPU is not just good - its perfect for that and even has special matrix multiplication hardware (Tensor Cores) which is two orders of magnitude faster (on a single gpu) than your cpu and 10x faster than a 64 core threadripper.

NNUE circumvents that GEMM cost by incrementally updating the first overparametrizised layer and using 8 bits between as AVX2 provides some very cheap 32x8 dot products.

My question is why not put all leaf nodes in a vector - send that to the gpu - and have all the evaluation done in bulk. Then you minmax + prune dead nodes - and send all leaf nodes to the gpu again?

Neural net - too slow obviously.
Classic chessprograms completely annihilate all the neural net programs if you change search a little of course as i wrote down.
Not easy to test at classic chessprograms what changes need to get made there - because it's all bullet optimized so far and less nps in bullet means lower elo. Things change if you got 64 cores of course and already know you're gonna search millions of nodes anyway.

Fast AMD cpu now: 64 cores. 5 Tflops double precision.
Fast Nvidia gpu now: 10 Tflops on paper - but no one manages to get it out - $10k and price going up.

At Nvidia if you can get 25% out of it that's considered very good.

In chess important is to order moves correctly. This is ugly slow on gpu's.
I did tests here trying to get prime number sieve faster in CUDA.

Basically whether you use the L1 datacache or the register file - same slow speed there.
All the cudacores of a single warp need to atomically execute the same code.

So anything that 'prunes' or needs data from a different cacheline or register file than another core causes extra bandwidth on the gpu directly slowing it down at least factor 32.

Anything neural net related is too slow of course to even take it serious for realtime applications.
It means there is a better solution possible realtime on CPU's.

We can mathematically even proof this.
It's just a lazy way to get things done those neural nets.

If you want to parameter tune with neural nets something in the slow manner then use it on a CPU realtime in a search - now that's an entirely different discussion

diep · Post by **diep** » Fri Nov 05, 2021 9:24 pm

dangi12012 wrote: ↑Fri Nov 05, 2021 9:05 pm
diep wrote: ↑Fri Nov 05, 2021 8:39 pm
dangi12012 wrote: ↑Fri Nov 05, 2021 4:46 pm I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.

Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.

Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)

Am I missing something?
Yes quite a lot you miss.

An evaluation at the gpu would only be useful if it removes computational effort from the CPU.
Ummmm... You are talking about an "evaluation function" which is already solved in the form of a neural network which doesnt suffer from any horizon effect as bad. You know that NNUE and any neural network is linear algebra (GEMM) with an activation function (RELU) on the output. A GPU is not just good - its perfect for that and even has special matrix multiplication hardware (Tensor Cores) which is two orders of magnitude faster (on a single gpu) than your cpu and 10x faster than a 64 core threadripper.

NNUE circumvents that GEMM cost by incrementally updating the first overparametrizised layer and using 8 bits between as AVX2 provides some very cheap 32x8 dot products.

My question is why not put all leaf nodes in a vector - send that to the gpu - and have all the evaluation done in bulk. Then you minmax + prune dead nodes - and send all leaf nodes to the gpu again?

Neural net - too slow obviously.
Classic chessprograms completely annihilate all the neural net programs if you change search a little of course as i wrote down.
Not easy to test at classic chessprograms what changes need to get made there - because it's all bullet optimized so far and less nps in bullet means lower elo. Things change if you got 64 cores of course and already know you're gonna search millions of nodes anyway.

Fast AMD cpu now: 64 cores. 5 Tflops double precision.
Fast Nvidia gpu now: 10 Tflops on paper - but no one manages to get it out - $10k and price going up.

At Nvidia if you can get 25% out of it that's considered very good.

In chess important is to order moves correctly. This is ugly slow on gpu's.
I did tests here trying to get prime number sieve faster in CUDA.

Basically whether you use the L1 datacache or the register file - same slow speed there.
All the cudacores of a single warp need to atomically execute the same code.

So anything that 'prunes' or needs data from a different cacheline or register file than another core causes extra bandwidth on the gpu directly slowing it down at least factor 32.

Anything neural net related is too slow of course to even take it serious for realtime applications.
It means there is a better solution possible realtime on CPU's.

We can mathematically even proof this.
It's just a lazy way to get things done those neural nets.

If you want to parameter tune with neural nets something in the slow manner then use it on a CPU realtime in a search - now that's an entirely different discussion

The underlying problem of gpu's is that any complex algorithm that either uses hashtables or needs to order things, it cannot function at GPU's as all the cuda cores need to execute the same code at the same atomic time and all cuda cores need to read each cacheline as a whole at once. Individual cuda cores if you let each one of them read from a different spot, that triggers internally a lot of bandwidth and is simply factor 32x slower or more in all my tests. And when we tested it at a cluster with some thousands of latest Tesla's - they are even slower there than my old GPU for this sort of instructions.

For sieving here which is dead simple on paper - some years ago it seemed fast on gpu's. But now we got 64 core threadrippers. Completely annihilate the GPU's.

Only the utmost stupid algorithms can work onto GPU's. And doing full matrix calculations without Fourier type approximations. That's simple code to execute. Simple. Easy algorithm. That's where GPU's shine.

All other tasks it's dog food.

Why is there no bulk evaluation happening in Alpha Beta?

Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?

Re: Why is there no bulk evaluation happening in Alpha Beta?