Pytorch NNUE training

gladius · Post by **gladius** » Sun Nov 22, 2020 4:17 am

We've been trying a bunch of experiments, but not a huge amount of forward progress. We are at about -80 elo from master, but still a long ways away. This is all still training on d5 data though, so I'm hopeful that higher depth data will be a big win (unfortunately, it takes a really long time to generate 2B fens, which is how many we are training on for d5!).

Made a few discoveries - the scaling factor of 600 in the loss function was actually adjusted in the nodchip trainer to be `1.0 / PawnValueEg / 4.0 * 2.302585092994046 std::log(10.0)`, or roughly 361. We tried training with that, and it didn't help much unfortunately.

Also measured the loss function we were using, converting the best SF net to pytorch format, and then running different lambda values over it. Discovered that with the scale factor at 600, the SF best net had worse (higher) loss overall, which is a bad sign, as it means our loss function is not representative of ELO (although it is correlated). So a big open question is if we can find a loss function that maps well to ELO.

Just today, noobpwnftw discovered a bug in the game outcomes recorded in the training data, so any training runs with lambda != 1.0 are a bit suspect.

Sopel is adding support for multi-GPU training, so we are hopefully going to be going through the data at lightspeed soon!

nkg114mc · Post by **nkg114mc** » Wed Nov 25, 2020 8:05 am

Hi Gary,

Thanks for sharing the thoughts and code! I once had the same idea of using pytorch to do the NNUE training, glad that you guys have taken it into action~

One thing that once block me about using pytorch to do the training is the inference step. I see in the nodchip's original learner implementation, his inference is done by running qsearch over a given sfen example, and then compute gradient with respect to the leaf position of qsearch PV (instead of the root sfen position read from bin file). My understand about why he did this: probably because he want to find the real "converge position" starting from current sfen position, which is the actual position that evaluation function will be called upon during the real game. If we did exact the same inference with qsearch as nodchip's implementation in NNUE training, the python code would have to depend on the SF qsearch implementation (or a python converted version), that would dramatically increase the complexity of training code, and affect the speed of inference.

So I am curious how did you resolve the "qsearch inference" issue above? Did you re-implement the Stockfish qsearch in python? Let's say suppose we did not use the qsearch to do inference but directly call NNUE evaluation over the sfen position, do you think that will make a big difference on learning result?

AndrewGrant · Post by **AndrewGrant** » Wed Nov 25, 2020 1:02 pm

So a question, which is not really per this threads name, but was touched on here.
What efforts can be taken, during training when using floats, to try to map the weights and biases into a smaller range. IE, get most of the weights within [-2.0, 2.0], such that quantization can be done with minimal clipping AND so that overflows are avoided more often?

I have a wonky Network setup right now, where the Input layer is int16, as are the first layer's weights, and then afterwards everything is floats. Dropping the first layer to int8 would be a massive speedgain, and done trivially with the AVX operations. It would also open up an optimization with the 2nd layer, getting native int16t output.

gladius · Post by **gladius** » Wed Nov 25, 2020 5:58 pm

AndrewGrant wrote: ↑Wed Nov 25, 2020 1:02 pm So a question, which is not really per this threads name, but was touched on here.
What efforts can be taken, during training when using floats, to try to map the weights and biases into a smaller range. IE, get most of the weights within [-2.0, 2.0], such that quantization can be done with minimal clipping AND so that overflows are avoided more often?

I have a wonky Network setup right now, where the Input layer is int16, as are the first layer's weights, and then afterwards everything is floats. Dropping the first layer to int8 would be a massive speedgain, and done trivially with the AVX operations. It would also open up an optimization with the 2nd layer, getting native int16t output.

Quantization aware training is the best way I know to do this. That's where the training process simulates the quantization down to 8 bits, so the loss function can take that into account. https://pytorch.org/tutorials/advanced/ ... e-training is an example from Pytorch.

gladius · Post by **gladius** » Wed Nov 25, 2020 6:00 pm

nkg114mc wrote: ↑Wed Nov 25, 2020 8:05 am So I am curious how did you resolve the "qsearch inference" issue above? Did you re-implement the Stockfish qsearch in python? Let's say suppose we did not use the qsearch to do inference but directly call NNUE evaluation over the sfen position, do you think that will make a big difference on learning result?

Sopel implemented a version of the nodchip gensfen command that does the same as the training qsearch evaluation, and writes those out as the training data. But we've been experimenting here, and also just trying skipping non-quiet positions (where recorded bestmove is a capture), and that seems to be working quite well too. So I think there is a lot of area to experiment here.

jonkr · Post by **jonkr** » Wed Nov 25, 2020 11:27 pm

For my neural net training (somewhat different, not NNUE,) I had at some point turned off the qSearch in my pgn->positions step. A few days ago I turned it on again and rebuilt my positions and nets and tested at +3.2 elo. Not definitive given just 20,000 very fast games, but strength has continued to get better with training and additional testing so unlikely it was just one randomly good result. So I think for small endgame nets not needing to handle non-quiet positions is a small plus.

I think I originally turned off qsearch when I started training some new nets, and noticed the qsearch kept doing weird captures since it used the net value to decide if a capture was good and once it started doing a capture it would get reinforced since I use game result as target value. Now I turn off the nets for training position qsearch and use the old eval.

This thread has been interesting to read, shows there has been some good work that went into the NNUE trainer if it's not easy to recreate (I suppose the strength increase over the hard-coded evals shows that too.) Have you tried any different search depths yet to see how that affects things?

gladius · Post by **gladius** » Thu Nov 26, 2020 7:02 pm

Another quick update on some experiments:
- Tried training from scratch using 1B d20 games, and it was disastrous. ELO was far worse than the d5 positions. So, interestingly, it appears that depth is just way too deep for the network to learn on.

- vondele validated that you can train a net equal to master net using d0 data, starting from the master net weights, then training for a while. So that validates that lambda 1.0 is working well, and the training code is capable of matching master net strength. How to get there from arbitrary data is the open question.

Current line of thought is doing some low LR training of master net, with the d20 data.

xr_a_y · Post by **xr_a_y** » Thu Nov 26, 2020 8:55 pm

But current official or master nets has been built with data of what depth ?

gladius · Post by **gladius** » Fri Nov 27, 2020 5:20 am

xr_a_y wrote: ↑Thu Nov 26, 2020 8:55 pm But current official or master nets has been built with data of what depth ?

I believe Sergio used depth 12-14 data, with about 10 billion positions.

tomitank · Post by **tomitank** » Sat Dec 05, 2020 1:50 pm

Hi @gladius:

have you tried a method similar to mine?
http://talkchess.com/forum3/viewtopic.p ... 25#p874604

I mean, the target(output) is pre-activated. I am currently trying with few samples. But I have already won a few elo.
Because my hardware is limited, I don’t want to waste time unnecessarily.

Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training