Think about it. If you have a single layer network - its literally the same as the positional score used in stockfish and others for the last 30 years or so. https://www.chessprogramming.org/Simpli ... are_Tablesdiep wrote: ↑Fri Nov 05, 2021 9:20 pmNeural net - too slow obviously.dangi12012 wrote: ↑Fri Nov 05, 2021 9:05 pmUmmmm... You are talking about an "evaluation function" which is already solved in the form of a neural network which doesnt suffer from any horizon effect as bad. You know that NNUE and any neural network is linear algebra (GEMM) with an activation function (RELU) on the output. A GPU is not just good - its perfect for that and even has special matrix multiplication hardware (Tensor Cores) which is two orders of magnitude faster (on a single gpu) than your cpu and 10x faster than a 64 core threadripper.diep wrote: ↑Fri Nov 05, 2021 8:39 pmYes quite a lot you miss.dangi12012 wrote: ↑Fri Nov 05, 2021 4:46 pm I see that many people talk about gpu latency and dismiss the possibility of using cuda outright.
Alpha Beta prunes a move if something better is already found. But what should stop an engine to enumerate 100000 moves and then send them to the gpu for evaluation as a bulk operation?
Of course some moves would be calculated unnecessary but the alphabeta comparison can also be done in bulk and with modern memory mapping so that the cpu already knows which nodes to expand further.
Modern CUDA does some insane 140 Terraflops with fp16 and thats even more accurate than the cpu 8bit models. I find it bad that everyone copies NNUE and doesnt develop something different. (except lc0)
Am I missing something?
An evaluation at the gpu would only be useful if it removes computational effort from the CPU.
NNUE circumvents that GEMM cost by incrementally updating the first overparametrizised layer and using 8 bits between as AVX2 provides some very cheap 32x8 dot products.
My question is why not put all leaf nodes in a vector - send that to the gpu - and have all the evaluation done in bulk. Then you minmax + prune dead nodes - and send all leaf nodes to the gpu again?
Classic chessprograms completely annihilate all the neural net programs if you change search a little of course as i wrote down.
Not easy to test at classic chessprograms what changes need to get made there - because it's all bullet optimized so far and less nps in bullet means lower elo. Things change if you got 64 cores of course and already know you're gonna search millions of nodes anyway.
Fast AMD cpu now: 64 cores. 5 Tflops double precision.
Fast Nvidia gpu now: 10 Tflops on paper - but no one manages to get it out - $10k and price going up.
At Nvidia if you can get 25% out of it that's considered very good.
In chess important is to order moves correctly. This is ugly slow on gpu's.
I did tests here trying to get prime number sieve faster in CUDA.
Basically whether you use the L1 datacache or the register file - same slow speed there.
All the cudacores of a single warp need to atomically execute the same code.
So anything that 'prunes' or needs data from a different cacheline or register file than another core causes extra bandwidth on the gpu directly slowing it down at least factor 32.
Anything neural net related is too slow of course to even take it serious for realtime applications.
It means there is a better solution possible realtime on CPU's.
We can mathematically even proof this.
It's just a lazy way to get things done those neural nets.
If you want to parameter tune with neural nets something in the slow manner then use it on a CPU realtime in a search - now that's an entirely different discussion![]()
You would multiply each occupation type by its own worth per square! Mathematically identical except the activation part. (which is really the same for positives with RELU)
With more layers you get the advanced parts so that the network can evaluate forks and pins etc.
You are unlucky there because I have an RTX 3080 and get real end to end 140 Tflops for fp16. fp64 is not needed in neural networks at all. The sweetspot seems to be 8 int or 16 bits float depending on the problem domain.
There are no high end "classical chessprogramms" anymore - all top ones use neural networks. Look at the TCEC please!
My question is for an expert with good knowledge of why bulk evaluation was not tried before - or why it failed.
My guess is that there are some reasons that some people on this board can answer.