WIP, Eta - GPGPU ANN based engine, RFC

smatovic · Post by **smatovic** » Wed Feb 06, 2019 4:45 pm

Heyho,

inspired by the recent discussions on CCC i want to give a parallel BestFirst-
MiniMax-Search with ANN eval on gpu a try.

Here the design feature list:

- GPGPU device based
* host handles only the IO, search and ANN inference on gpu
* gpu computation will be limited by node count to about 1 second per
repeated iteration, to avoid any system timeouts

- parallel BestFirstMiniMax-Search on gpu
* game tree in gpu memory
* best node selected via score + UCT formula (visit count based)
* AlphaBeta Q-Search performed at leafnodes to get a node score

- multiple small MLP neural networks
* about 4 million weights per network
* 30 networks in total, split by piece count

- trained via TD-leaf by pgn games
* 6/7 men EGTB could be used for training?

- 64 gpu threads are coupled to one worker
* used during move gen, move pick and ANN eval in parallel
* gpu core count depended from 64 workers to 2048 workers in total

Some quick and dirty benchmarks showed that with this design ~1 Knps per worker
is possible.

RFC, would be interested in any possible design flaws you see...

--
Srdja

grahamj · Post by **grahamj** » Wed Feb 06, 2019 5:23 pm

smatovic wrote: ↑Wed Feb 06, 2019 4:45 pm small MLP neural networks
[...]
* game tree in gpu memory
* best node selected via score + UCT formula (visit count based)

Srdja

Small NNs means the evaluation will not be very good, so you need a large search tree. The UCT method requires storing the whole search tree. How will that fit in memory?

smatovic · Post by **smatovic** » Wed Feb 06, 2019 5:46 pm

grahamj wrote: ↑Wed Feb 06, 2019 5:23 pm Small NNs means the evaluation will not be very good, so you need a large search tree.

I hope the Q-Search at leafnodes will do some magic.

grahamj wrote: ↑Wed Feb 06, 2019 5:23 pm The UCT method requires storing the whole search tree. How will that fit in memory?

With the game tree stored in gpu memory propably only blitz time controls will
be possible with that engine.

I need about 40 Bytes per node, i presume about 2 Mnps on an highend gpu,
~10 per cent will be expanded nodes, the rest Q-Search nodes, so 200K
expanded nodes per second need about 8 MB gpu memory.

One big caveat is that in OpenCL you can not use all of gpu memory,
Nvidia offers only 1/4 for allocation,
on AMD site it differs between drivers and hardware.

--
Srdja

grahamj · Post by **grahamj** » Wed Feb 06, 2019 7:43 pm

smatovic wrote: ↑Wed Feb 06, 2019 5:46 pm I hope the Q-Search at leafnodes will do some magic.

If the NN is only called to evaluate quiescent positions, it should mostly, perhaps entirely, be trained on quiescent positions. It ciould be tricky to collect suitable training data, especially If your definition of quiescent is complicated.

People usually hope the NN will do some magic to avoid the need for Q-search.

smatovic wrote: ↑Wed Feb 06, 2019 5:46 pm One big caveat is that in OpenCL you can not use all of gpu memory,
Nvidia offers only 1/4 for allocation,
on AMD site it differs between drivers and hardware.

That's a pain. Do you understand why this is?

smatovic · Post by **smatovic** » Fri Feb 08, 2019 6:27 pm

grahamj wrote: ↑Wed Feb 06, 2019 7:43 pm
smatovic wrote: ↑Wed Feb 06, 2019 5:46 pm I hope the Q-Search at leafnodes will do some magic.
If the NN is only called to evaluate quiescent positions, it should mostly, perhaps entirely, be trained on quiescent positions. It ciould be tricky to collect suitable training data, especially If your definition of quiescent is complicated.

People usually hope the NN will do some magic to avoid the need for Q-search.

Hmm, maybe the other way around,
do you think that LC0 would profit by adding a Quiescence-Search at zero costs?

If yes, then my approach could make sense.
If no, then there is nothing to gain.

grahamj wrote: ↑Wed Feb 06, 2019 7:43 pm
smatovic wrote: ↑Wed Feb 06, 2019 5:46 pm One big caveat is that in OpenCL you can not use all of gpu memory,
Nvidia offers only 1/4 for allocation,
on AMD site it differs between drivers and hardware.
That's a pain. Do you understand why this is?

Nvidia says that's the way they interpret the OpenCL standard.

Afaik, you can allocate more than 1/4 in practice,
but people in the forums say that this could slow things down.

Dunno what the real reason is, maybe some kind of limited address space.

--
Srdja

grahamj · Post by **grahamj** » Sat Feb 09, 2019 9:54 am

smatovic wrote: ↑Fri Feb 08, 2019 6:27 pm
grahamj wrote: ↑Wed Feb 06, 2019 7:43 pm
smatovic wrote: ↑Wed Feb 06, 2019 5:46 pm I hope the Q-Search at leafnodes will do some magic.
If the NN is only called to evaluate quiescent positions, it should mostly, perhaps entirely, be trained on quiescent positions. It ciould be tricky to collect suitable training data, especially If your definition of quiescent is complicated.

People usually hope the NN will do some magic to avoid the need for Q-search.
Hmm, maybe the other way around,
do you think that LC0 would profit by adding a Quiescence-Search at zero costs?

If yes, then my approach could make sense.
If no, then there is nothing to gain.

--
Srdja

I don't know what you mean by 'zero cost'. More interestingly, I don't know exactly what you mean by 'quiescent' and I think you have to be precise.

If you have an evaluation function f, the appropriate definition of quiescent for f is 'positions where f is innaccurate'. It is easy to see that a simple material evaluation will be innaccurate in the middle of a capture sequence. But characterising the positions where LC0's NN is innaccurate may be as hard as making the NN more accurate in those positions.

If you settle on a fixed definition of quiescent, X, you might be able to come up with a variant of TDL where you train on X positions using lookahead with q-search to other X positions. This might make the NN's task simpler so it will be more accurate for X positions.

smatovic · Post by **smatovic** » Sat Feb 09, 2019 12:49 pm

smatovic wrote: ↑Fri Feb 08, 2019 6:27 pm
grahamj wrote: ↑Wed Feb 06, 2019 7:43 pm
smatovic wrote: ↑Wed Feb 06, 2019 5:46 pm I hope the Q-Search at leafnodes will do some magic.
If the NN is only called to evaluate quiescent positions, it should mostly, perhaps entirely, be trained on quiescent positions. It ciould be tricky to collect suitable training data, especially If your definition of quiescent is complicated.

People usually hope the NN will do some magic to avoid the need for Q-search.
Hmm, maybe the other way around,
do you think that LC0 would profit by adding a Quiescence-Search at zero costs?

If yes, then my approach could make sense.
If no, then there is nothing to gain.

--
Srdja

grahamj wrote: ↑Sat Feb 09, 2019 9:54 am I don't know what you mean by 'zero cost'. More interestingly, I don't know exactly what you mean by 'quiescent' and I think you have to be precise.

With 'zero cost' i mean adding a qsearch to LC0 without any additional computation costs, no nps drop or similar, as thought experiment.

I mean the qsearch, as defined on CPW, used to avoid the horizon effect:

https://www.chessprogramming.org/Quiescence_Search

But to be honest, i don't know enough about A0/LC0 to say if a qsearch can be simply added.

grahamj wrote: ↑Sat Feb 09, 2019 9:54 am If you have an evaluation function f, the appropriate definition of quiescent for f is 'positions where f is innaccurate'. It is easy to see that a simple material evaluation will be innaccurate in the middle of a capture sequence. But characterising the positions where LC0's NN is innaccurate may be as hard as making the NN more accurate in those positions.

If you settle on a fixed definition of quiescent, X, you might be able to come up with a variant of TDL where you train on X positions using lookahead with q-search to other X positions. This might make the NN's task simpler so it will be more accurate for X positions.

The plan with a BestFirstMiniMax-Search is not to improve the NN eval,
but to improve the tactical play of BestFirst based search algorithms that use NN for evaluation.

In theory the mentioned Q-Search at leafnodes can be extended to an classic AlphaBeta-Search with depth n,
that is the idea of BestFirstMiniMax, to perform a shallow MiniMax search at leaf nodes to get a node score,
the deeper the search, the more accurate is the node score.

Best-first minimax search, Richard E. Korf and David Maxwell Chickering,
http://maxchickering.com/publications/aij96.pdf

--
Srdja

grahamj · Post by **grahamj** » Sat Feb 09, 2019 3:44 pm

smatovic wrote: ↑Sat Feb 09, 2019 12:49 pm With 'zero cost' i mean adding a qsearch to LC0 without any additional computation costs, no nps drop or similar, as thought experiment.
--
Srdja

I don't know enough to comment on your search algorithm. I can suggest a real experiment which might help you evaluate your q-search idea. In viewtopic.php?f=2&t=69845#p789268 dkappe points out that LC0's value head often disagrees a lot with SF10 100k nodes. Are these positions that would be regarded as non-quiescent by https://www.chessprogramming.org/Quiescence_Search ? If so, your method seems more promising.

smatovic · Post by **smatovic** » Sat Feb 09, 2019 4:25 pm

grahamj wrote: ↑Sat Feb 09, 2019 3:44 pm
smatovic wrote: ↑Sat Feb 09, 2019 12:49 pm With 'zero cost' i mean adding a qsearch to LC0 without any additional computation costs, no nps drop or similar, as thought experiment.
--
Srdja
I don't know enough to comment on your search algorithm. I can suggest a real experiment which might help you evaluate your q-search idea. In viewtopic.php?f=2&t=69845#p789268 dkappe points out that LC0's value head often disagrees a lot with SF10 100k nodes. Are these positions that would be regarded as non-quiescent by https://www.chessprogramming.org/Quiescence_Search ? If so, your method seems more promising.

Thanks for the experiment idea,
i consider the first position as quiet, the other two have capture moves,
maybe i have to analyze some more LC0 blind spots to figure out
if the qsearch idea would work.

***edit***
Ah, dkappe posted some more positions, will take a look, thx.

--
Srdja

smatovic · Post by **smatovic** » Tue Feb 26, 2019 3:36 pm

Okay, some further, not so dirty, benchmarks showed

~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X

per worker.

I assume on modern gpus about 200 nps.

While NN cache could be able to double these values,
this is imo a bit too slow for the intended search algorithm,
considering about 36x10 qsearch positions on average per expanded node,
one worker would need about a second to get a node score.

Back to pen n paper.

--
Srdja

WIP, Eta - GPGPU ANN based engine, RFC

WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC

Re: WIP, Eta - GPGPU ANN based engine, RFC - benchs