WIP, Eta - GPGPU ANN based engine, RFC

Discussion of chess software programming and technical issues.

Moderator: Ras

Hrvoje Horvatic
Posts: 22
Joined: Mon Nov 10, 2014 1:53 pm

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by Hrvoje Horvatic »

smatovic wrote: Wed Feb 06, 2019 4:45 pm RFC, would be interested in any possible design flaws you see...
How do you define design flaw? I am Software Architect, and what you are trying to do is something completely new, so there is no established background theory/knowledge that can be used to judge whether something is a "design flaw" or not...

There are opinions and educated guesses though, and I can give you mine, but no one can claim to be an expert on this and dismiss ideas as "design flaws"...
- parallel BestFirstMiniMax-Search on gpu
* game tree in gpu memory
* best node selected via score + UCT formula (visit count based)
* AlphaBeta Q-Search performed at leafnodes to get a node score
I think UCT doesn't work for chess... the reason Leela and A0 show good results is because they are running on enormously powerful hardware, practically super-computers...

UCT is good for strategic games, but chess is both strategic AND tactical... what UCT does is sampling subspace and estimate "average" value, which is useless in case of combinations...

Best first search in general is not "auto-correcting" like alpha-beta... early mistakes are not corrected, so the search very often gets trapped in one part of the tree... A0 has a constant which balances between exploration and exploitation, but that approach in general doesn't work, it's more of a "patch" to cover a wound, than a real solution to a problem...

If you have decided to go for best-first you should check Stuart Russell's Metareasoning approach... it is mathematically more sound approach with some added benefits... start with the paper "Metareasoning for Monte Carlo Tree Search" (there are a few hundred papers on metareasoning, but this is a relevant place to start)
- multiple small MLP neural networks
* about 4 million weights per network
* 30 networks in total, split by piece count
I don't think MLP will do it... If you consider a 3 layer perceptron, it doesn't have enough of "representational power" to represent complex concepts, and if you add additional layers, learning time grows exponentially... It is also pretty bad in extracting geometrical concepts (which chess is all about)...

There are some domains in which MLP shines brightly, but chess is not one of them... I recall that I have read several papers comparing MLPs with convolutional ANNs, and they loose every time... notice that multi-layer CNNs don't have the same problem of exponential learning time as MLPs, because they "cheat" in multiple ways, and they are also constructed from the beginning for geometric problems...

anyway, just my 2 cents... NOT design flaws but opinions...

I wish you all the best, and don't quit easily... :)
smatovic
Posts: 3359
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by smatovic »

Eta - v0501

Recently the new neural network technique 'NNUE' took off on CPU based chess engines like Stockfish leveraging the vector unit of a CPU for NN inference, replacing HCE (handcrafted evaluation) with neural-networks. Hence with NNUE a hybrid design with BestFirst on CPU and MiniMax-Search with NNUE eval on GPU seems possible and in reach. The CPU-host would store and expand the game tree in memory, similar to Lc0's MCTS, the GPU would perform shallow AlpaBeta-searches (primarily quiscence-search playouts to avoid the horizon effect), similar to Lc0's MCTS-playouts.

Coupling 32 gpu-threads to one worker, assuming 2K clocks per node for move generation and AB-framework, additionally maybe 2K clocks per node for NNUE inference, results in 1.44M gpu-clocks for an 36x10 nodes q-search. In such an design the host-device-latency (aka. kernel-launch-overhead) of maybe 10 microseconds does not affect the overall performance. From entry-level GPUs with 512 cores (16 workers) to high-end-gpus with 5120 cores (160 workers) the throughput of such an parallel BestFirst on CPU and AB-playout+NNUE-eval on GPU design could range from ~11K to ~220K node-playouts/s, more than Lc0's gpu throughput but with a switch from MCTS-PUCT to parallel BestFirstMiniMax-Search and CNN to NNUE evaluation.

I am not into the details of current NNUE implementations for CPUs, therefore the estimated 2K gpu-clocks per node for NNUE inference is the biggest uncertainty.

I have no experience with running 16 to 160 parallel tasks via OpenCL on GPU, not sure if 160 unique command-queues are handable with CPU-GPU interaction.

More on the Eta blog:

https://eta-chess.app26.de

--
Srdja
smatovic
Posts: 3359
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by smatovic »

A BestFirst on CPU and MiniMax-playout with NNUE eval GPU design could utilize the AB-framework of the Zeta v099 gpu-engine. But considering just an ply 1 + quiescence search an alternative implementation as LIFO-stack seems reasonable. This would simplify the iterative implementation of an recursive AB search for GPU architecture. Couple 32 gpu-threads to run on one SIMD unit of the GPU, use these 32 threads for move generation (piece-wise parallelization) and NNUE evaluation inference, store the game tree as an doubly-linked list in VRAM, apply LIFO-stack based processing on game tree with AB pruning, something like this.

This design seems like a natural fit for CPU-GPU arch to me, you can also run multiple SIMT waves of workers to increase VALU utilization, and like a marriage between BestFirst and AB with NNUE, I like the NNUE driven BestFirst component. Another step could be to do Reinforcement Learning on a gpu-cloud-cluster, like Lc0.

I concluded that Eta is too big for me as a hobby project and I won't implement it, maybe someone else finds useful information in here, or alike.

--
Srdja
smatovic
Posts: 3359
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by smatovic »

ups, too much coffee....LIFO would not work with NNUE, but a classic NN ;)

--
Srdja
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by dkappe »

I’ve been combining ab search with mcts for a while. A few observations:

1. Mcts/nn plays really well at low nodes vs ab. For example, with tree reuse, the tcec network at 100-120 nodes per move is on par with with SF14 d11 (d13 in the endgame).
2. I’ve found that applying ab to the mcts/nn tree works best when boosting policy on frequently visited nodes, I.e. bestmove from a shallow ab search on a node that has just gotten 100 visits, for example, will help the mcts search hone in on forcing moves, etc.
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.

There’s still lots of experimentation to be done with combining ab and mcts.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
smatovic
Posts: 3359
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by smatovic »

dkappe wrote: Sun Jul 11, 2021 11:00 pm ...
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.
...
To get an idea of this one could run an Lc0 depth 1 (no policy-head, n playouts) vs SF NNUE depth 1 (root+ab-q-search) match.

--
Srdja
Joerg Oster
Posts: 982
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany
Full name: Jörg Oster

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by Joerg Oster »

smatovic wrote: Mon Jul 12, 2021 6:02 am
dkappe wrote: Sun Jul 11, 2021 11:00 pm ...
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.
...
To get an idea of this one could run an Lc0 depth 1 (no policy-head, n playouts) vs SF NNUE depth 1 (root+ab-q-search) match.

--
Srdja
Stockfish is weak at these low depths. No doubt.
Is it really possible to switch off the policy head for Lc0?
Jörg Oster
smatovic
Posts: 3359
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by smatovic »

Joerg Oster wrote: Mon Jul 12, 2021 10:30 am
smatovic wrote: Mon Jul 12, 2021 6:02 am
dkappe wrote: Sun Jul 11, 2021 11:00 pm ...
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.
...
To get an idea of this one could run an Lc0 depth 1 (no policy-head, n playouts) vs SF NNUE depth 1 (root+ab-q-search) match.

--
Srdja
Stockfish is weak at these low depths. No doubt.
Is it really possible to switch off the policy head for Lc0?
I don't know, maybe you have to touch the source code, maybe SF also does Root-Pruning which should also be disabled for such an run.

--
Srdja
smatovic
Posts: 3359
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by smatovic »

Eta - v0501 - command-queues:

I figured I could run maybe 32 gpu-workers per thread efficient, to let one CPU thread iterate the game tree in memory for 32 workers who perform AB-playouts on GPU, to use one OpenCL command-queue per thread.

Project is in my pipe again, will work on Zeta NNUE first, if it runs, I will use the Zeta AB-NNUE framework for Eta, will take some time.

--
Srdja
MonteCarlo
Posts: 188
Joined: Sun Dec 25, 2016 4:59 pm

Re: WIP, Eta - GPGPU ANN based engine, RFC

Post by MonteCarlo »

smatovic wrote: Mon Jul 12, 2021 11:07 am
I don't know, maybe you have to touch the source code, maybe SF also does Root-Pruning which should also be disabled for such an run.

--
Srdja
This was done long ago in Aloril's old sheet. There was a hack by Tilps, I think, to have Leela pick moves based on value head assessment of all moves at root, effectively a d1 search using only value head.

Aloril tested this against an SF depth 1 search, and as I recall the best 64x6 net was a bit weaker than SF9 d1 and the best 128x10 was a bit better, so that size was where the crossover was.

The best 256x20 nets at the time were something like 500-600 points stronger with the d1 value head hack vs SF9 d1. The d1 value head hack with the strongest 256x20 was also much stronger than SF9 at 1k nodes per move, and dropped to something like -150 rating performance against SF9 at 10k nodes.

Of course, this was a very long time ago, early 2018, I think, so much has changed (and my memory's probably not 100% accurate).

Cheers!