Pytorch NNUE training

gladius · Post by **gladius** » Sat Nov 14, 2020 1:53 am

The adventure continues! So far, I've managed to validate that we can import a quantized SF NNUE net into the pytorch float32 representation, and export it back without any loss (even the same sha for the nnue net). So the quantization/dequantization code looks solid. That's done with https://github.com/glinscott/nnue-pytor ... ze.py#L128. Also, validated that the evals from the SF NNUE nets imported to pytorch match the evals from the SF code. So the baseline stuff is looking fairly solid.

Now, for the bad news! The initial nets trained against 10m d5 games were completely disastrous, about -700 elo or so. For comparison, the nodchip trainer gets to about -350 elo or so on the same dataset. The nodchip trainer also has this concept of a factorizer, which generalizes the positions fed into the net. I will need to look deeper into that.

For next steps, going to actually train on a reasonable amount of games - 500m or so, and then see how we are doing.

OfekShochat · Post by **OfekShochat** » Sat Nov 14, 2020 6:27 am

that is great! thanks gladius

gladius · Post by **gladius** » Sat Nov 14, 2020 10:30 pm

An initial training run by vondele on 256M positions is looking much more promising (also takes about 10 minutes to do an epoch - a pass through the 256M positions, which is great). Getting closer! Note that all training runs so far have been with lambda = 1.0 (train to the evaluation, not the game result).

Code: Select all

Rank Name                      	Elo 	+/-   Games   Score   Draws
   1 master                    	253   	5   16205   81.1%   30.9%
   2 epoch7                    	-6   	4   16205   49.2%   46.3%
   3 epoch6                    	-8   	4   16205   48.8%   46.7%
   4 epoch3                    	-29   	4   16206   45.9%   46.4%
   5 epoch0                   	-190   	5   16205   25.0%   34.1%

elcabesa · Post by **elcabesa** » Sun Nov 15, 2020 9:37 am

I'm studying your code and Nodchip trainer.

in nodchip trainer, the position from which the feature list is calculated doesn't seems to be the one stored in the gensfen bin file, but is the position resulting from a qsearch. I think this will help in resolving recapure and other not quiet positions fed in the learner.

hope this can help

David Carteau · Post by **David Carteau** » Sun Nov 15, 2020 10:07 am

gladius wrote: ↑Sat Nov 14, 2020 10:30 pm An initial training run by vondele on 256M positions is looking much more promising (also takes about 10 minutes to do an epoch - a pass through the 256M positions, which is great).

Wow, it takes my trainer about... 3 days to perform only one epoch (360M positions) ! Great job !

gladius · Post by **gladius** » Sun Nov 15, 2020 5:17 pm

elcabesa wrote: ↑Sun Nov 15, 2020 9:37 am I'm studying your code and Nodchip trainer.

in nodchip trainer, the position from which the feature list is calculated doesn't seems to be the one stored in the gensfen bin file, but is the position resulting from a qsearch. I think this will help in resolving recapure and other not quiet positions fed in the learner.

hope this can help

Yes, this is a good point. The data generated is using a new option that Sopel added, called `ensure_quiet` that writes only quiet positions out to the file. Training on non-quiet positions is thought to be worse (once we get this working super well, I'm curious how much elo difference it will make though).

Rein Halbersma · Post by **Rein Halbersma** » Sun Nov 15, 2020 10:04 pm

Do any people out here have tried to directly import Keras/Tensorflow or PyTorch trained models (i.e. graphs + weights, not just weights) into C++?

For Keras/Tensorflow there is model.save() on the Python side and then LoadSavedModel on the C++ side to load a graph + weights and link against TF C++ libs. Then you get a Graph Session on which you can call run() to do a single eval() call. For PyTorch there is TorchScript to do similar things.

I wonder if such a route would be competitive compared to hand-written C++ evals that only import the trained weights but re-implement the network graph. At least for Tensorflow there seems to be some virtual function call overhead.

AndrewGrant · Post by **AndrewGrant** » Sun Nov 15, 2020 10:31 pm

gladius wrote: ↑Sat Nov 14, 2020 1:53 am The adventure continues! So far, I've managed to validate that we can import a quantized SF NNUE net into the pytorch float32 representation, and export it back without any loss (even the same sha for the nnue net). So the quantization/dequantization code looks solid. That's done with https://github.com/glinscott/nnue-pytor ... ze.py#L128. Also, validated that the evals from the SF NNUE nets imported to pytorch match the evals from the SF code. So the baseline stuff is looking fairly solid.

Now, for the bad news! The initial nets trained against 10m d5 games were completely disastrous, about -700 elo or so. For comparison, the nodchip trainer gets to about -350 elo or so on the same dataset. The nodchip trainer also has this concept of a factorizer, which generalizes the positions fed into the net. I will need to look deeper into that.

For next steps, going to actually train on a reasonable amount of games - 500m or so, and then see how we are doing.

I can give you a hint, and say that the factorizer is absolutely extremely important, and I don't think networks can even begin to compete with master networks without it.

Daniel Shawul · Post by **Daniel Shawul** » Sun Nov 15, 2020 10:33 pm

Rein Halbersma wrote: ↑Sun Nov 15, 2020 10:04 pm Do any people out here have tried to directly import Keras/Tensorflow or PyTorch trained models (i.e. graphs + weights, not just weights) into C++?

For Keras/Tensorflow there is model.save() on the Python side and then LoadSavedModel on the C++ side to load a graph + weights and link against TF C++ libs. Then you get a Graph Session on which you can call run() to do a single eval() call. For PyTorch there is TorchScript to do similar things.

I wonder if such a route would be competitive compared to hand-written C++ evals that only import the trained weights but re-implement the network graph. At least for Tensorflow there seems to be some virtual function call overhead.

It is not competitive at all -- NNUE via tensorflow C++ inference code is 300x slower than my hand-written eval.
https://github.com/dshawul/Scorpio/comm ... 3a4aaab826
I use this approach for bigger networks like ResNets, and it works fine there even on the CPU.
Usually evaluating the neural network takes much more time than the tensorflow c++ overhead of 20ms/call (which is the killer for NNUE)

gladius · Post by **gladius** » Mon Nov 16, 2020 7:11 am

AndrewGrant wrote: ↑Sun Nov 15, 2020 10:31 pm
gladius wrote: ↑Sat Nov 14, 2020 1:53 am The adventure continues! So far, I've managed to validate that we can import a quantized SF NNUE net into the pytorch float32 representation, and export it back without any loss (even the same sha for the nnue net). So the quantization/dequantization code looks solid. That's done with https://github.com/glinscott/nnue-pytor ... ze.py#L128. Also, validated that the evals from the SF NNUE nets imported to pytorch match the evals from the SF code. So the baseline stuff is looking fairly solid.

Now, for the bad news! The initial nets trained against 10m d5 games were completely disastrous, about -700 elo or so. For comparison, the nodchip trainer gets to about -350 elo or so on the same dataset. The nodchip trainer also has this concept of a factorizer, which generalizes the positions fed into the net. I will need to look deeper into that.

For next steps, going to actually train on a reasonable amount of games - 500m or so, and then see how we are doing.
I can give you a hint, and say that the factorizer is absolutely extremely important, and I don't think networks can even begin to compete with master networks without it.

Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it

. I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.

Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training