WIP, Eta - GPGPU ANN based engine, RFC

Hrvoje Horvatic · Post by **Hrvoje Horvatic** » Sun Mar 03, 2019 9:05 am

smatovic wrote: ↑Wed Feb 06, 2019 4:45 pm RFC, would be interested in any possible design flaws you see...

How do you define design flaw? I am Software Architect, and what you are trying to do is something completely new, so there is no established background theory/knowledge that can be used to judge whether something is a "design flaw" or not...

There are opinions and educated guesses though, and I can give you mine, but no one can claim to be an expert on this and dismiss ideas as "design flaws"...

- parallel BestFirstMiniMax-Search on gpu
* game tree in gpu memory
* best node selected via score + UCT formula (visit count based)
* AlphaBeta Q-Search performed at leafnodes to get a node score

I think UCT doesn't work for chess... the reason Leela and A0 show good results is because they are running on enormously powerful hardware, practically super-computers...

UCT is good for strategic games, but chess is both strategic AND tactical... what UCT does is sampling subspace and estimate "average" value, which is useless in case of combinations...

Best first search in general is not "auto-correcting" like alpha-beta... early mistakes are not corrected, so the search very often gets trapped in one part of the tree... A0 has a constant which balances between exploration and exploitation, but that approach in general doesn't work, it's more of a "patch" to cover a wound, than a real solution to a problem...

If you have decided to go for best-first you should check Stuart Russell's Metareasoning approach... it is mathematically more sound approach with some added benefits... start with the paper "Metareasoning for Monte Carlo Tree Search" (there are a few hundred papers on metareasoning, but this is a relevant place to start)

- multiple small MLP neural networks
* about 4 million weights per network
* 30 networks in total, split by piece count

I don't think MLP will do it... If you consider a 3 layer perceptron, it doesn't have enough of "representational power" to represent complex concepts, and if you add additional layers, learning time grows exponentially... It is also pretty bad in extracting geometrical concepts (which chess is all about)...

There are some domains in which MLP shines brightly, but chess is not one of them... I recall that I have read several papers comparing MLPs with convolutional ANNs, and they loose every time... notice that multi-layer CNNs don't have the same problem of exponential learning time as MLPs, because they "cheat" in multiple ways, and they are also constructed from the beginning for geometric problems...

anyway, just my 2 cents... NOT design flaws but opinions...

I wish you all the best, and don't quit easily...

smatovic · Post by **smatovic** » Sat Apr 24, 2021 12:15 pm

Eta - v0501

Recently the new neural network technique 'NNUE' took off on CPU based chess engines like Stockfish leveraging the vector unit of a CPU for NN inference, replacing HCE (handcrafted evaluation) with neural-networks. Hence with NNUE a hybrid design with BestFirst on CPU and MiniMax-Search with NNUE eval on GPU seems possible and in reach. The CPU-host would store and expand the game tree in memory, similar to Lc0's MCTS, the GPU would perform shallow AlpaBeta-searches (primarily quiscence-search playouts to avoid the horizon effect), similar to Lc0's MCTS-playouts.

Coupling 32 gpu-threads to one worker, assuming 2K clocks per node for move generation and AB-framework, additionally maybe 2K clocks per node for NNUE inference, results in 1.44M gpu-clocks for an 36x10 nodes q-search. In such an design the host-device-latency (aka. kernel-launch-overhead) of maybe 10 microseconds does not affect the overall performance. From entry-level GPUs with 512 cores (16 workers) to high-end-gpus with 5120 cores (160 workers) the throughput of such an parallel BestFirst on CPU and AB-playout+NNUE-eval on GPU design could range from ~11K to ~220K node-playouts/s, more than Lc0's gpu throughput but with a switch from MCTS-PUCT to parallel BestFirstMiniMax-Search and CNN to NNUE evaluation.

I am not into the details of current NNUE implementations for CPUs, therefore the estimated 2K gpu-clocks per node for NNUE inference is the biggest uncertainty.

I have no experience with running 16 to 160 parallel tasks via OpenCL on GPU, not sure if 160 unique command-queues are handable with CPU-GPU interaction.

More on the Eta blog:

https://eta-chess.app26.de

--
Srdja

smatovic · Post by **smatovic** » Fri Jun 25, 2021 7:24 pm

A BestFirst on CPU and MiniMax-playout with NNUE eval GPU design could utilize the AB-framework of the Zeta v099 gpu-engine. But considering just an ply 1 + quiescence search an alternative implementation as LIFO-stack seems reasonable. This would simplify the iterative implementation of an recursive AB search for GPU architecture. Couple 32 gpu-threads to run on one SIMD unit of the GPU, use these 32 threads for move generation (piece-wise parallelization) and NNUE evaluation inference, store the game tree as an doubly-linked list in VRAM, apply LIFO-stack based processing on game tree with AB pruning, something like this.

This design seems like a natural fit for CPU-GPU arch to me, you can also run multiple SIMT waves of workers to increase VALU utilization, and like a marriage between BestFirst and AB with NNUE, I like the NNUE driven BestFirst component. Another step could be to do Reinforcement Learning on a gpu-cloud-cluster, like Lc0.

I concluded that Eta is too big for me as a hobby project and I won't implement it, maybe someone else finds useful information in here, or alike.

--
Srdja

smatovic · Post by **smatovic** » Fri Jun 25, 2021 7:57 pm

ups, too much coffee....LIFO would not work with NNUE, but a classic NN

--
Srdja

dkappe · Post by **dkappe** » Sun Jul 11, 2021 11:00 pm

I’ve been combining ab search with mcts for a while. A few observations:

1. Mcts/nn plays really well at low nodes vs ab. For example, with tree reuse, the tcec network at 100-120 nodes per move is on par with with SF14 d11 (d13 in the endgame).
2. I’ve found that applying ab to the mcts/nn tree works best when boosting policy on frequently visited nodes, I.e. bestmove from a shallow ab search on a node that has just gotten 100 visits, for example, will help the mcts search hone in on forcing moves, etc.
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.

There’s still lots of experimentation to be done with combining ab and mcts.

smatovic · Post by **smatovic** » Mon Jul 12, 2021 6:02 am

dkappe wrote: ↑Sun Jul 11, 2021 11:00 pm ...
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.
...

To get an idea of this one could run an Lc0 depth 1 (no policy-head, n playouts) vs SF NNUE depth 1 (root+ab-q-search) match.

--
Srdja

Joerg Oster · Post by **Joerg Oster** » Mon Jul 12, 2021 10:30 am

smatovic wrote: ↑Mon Jul 12, 2021 6:02 am
dkappe wrote: ↑Sun Jul 11, 2021 11:00 pm ...
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.
...
To get an idea of this one could run an Lc0 depth 1 (no policy-head, n playouts) vs SF NNUE depth 1 (root+ab-q-search) match.

--
Srdja

Stockfish is weak at these low depths. No doubt.
Is it really possible to switch off the policy head for Lc0?

smatovic · Post by **smatovic** » Mon Jul 12, 2021 11:07 am

Joerg Oster wrote: ↑Mon Jul 12, 2021 10:30 am
smatovic wrote: ↑Mon Jul 12, 2021 6:02 am
dkappe wrote: ↑Sun Jul 11, 2021 11:00 pm ...
3. The ab q-search is pretty much going to be worse than a well trained resnet’s eval.
...
To get an idea of this one could run an Lc0 depth 1 (no policy-head, n playouts) vs SF NNUE depth 1 (root+ab-q-search) match.

--
Srdja
Stockfish is weak at these low depths. No doubt.
Is it really possible to switch off the policy head for Lc0?

I don't know, maybe you have to touch the source code, maybe SF also does Root-Pruning which should also be disabled for such an run.

--
Srdja

smatovic · Post by **smatovic** » Tue Sep 21, 2021 7:23 am

Eta - v0501 - command-queues:

I figured I could run maybe 32 gpu-workers per thread efficient, to let one CPU thread iterate the game tree in memory for 32 workers who perform AB-playouts on GPU, to use one OpenCL command-queue per thread.

Project is in my pipe again, will work on Zeta NNUE first, if it runs, I will use the Zeta AB-NNUE framework for Eta, will take some time.

--
Srdja

MonteCarlo · Post by **MonteCarlo** » Thu Sep 23, 2021 5:42 pm

smatovic wrote: ↑Mon Jul 12, 2021 11:07 am
I don't know, maybe you have to touch the source code, maybe SF also does Root-Pruning which should also be disabled for such an run.

--
Srdja

This was done long ago in Aloril's old sheet. There was a hack by Tilps, I think, to have Leela pick moves based on value head assessment of all moves at root, effectively a d1 search using only value head.

Aloril tested this against an SF depth 1 search, and as I recall the best 64x6 net was a bit weaker than SF9 d1 and the best 128x10 was a bit better, so that size was where the crossover was.

The best 256x20 nets at the time were something like 500-600 points stronger with the d1 value head hack vs SF9 d1. The d1 value head hack with the strongest 256x20 was also much stronger than SF9 at 1k nodes per move, and dropped to something like -150 rating performance against SF9 at 10k nodes.

Of course, this was a very long time ago, early 2018, I think, so much has changed (and my memory's probably not 100% accurate).

Cheers!

WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC