Pytorch NNUE training

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Raphexon
Posts: 476
Joined: Sun Mar 17, 2019 12:00 pm
Full name: Henk Drost

Re: Pytorch NNUE training

Post by Raphexon »

AndrewGrant wrote: Wed Nov 18, 2020 11:38 am Gary,

I'm trying to output my networks with an SF format. You have a couple different versions going in in your repo, but do https://github.com/glinscott/nnue-pytor ... rialize.py and https://github.com/glinscott/nnue-pytor ... r/model.py correspond to what would be matching Nodchip's outputs?

I've trained a network using the same loss as your model, and loss looks good. However, loading the weights into Stockfish is failing. I can confirm that I am not having an off by one issue; I can print weights/biases as I output them, and then they match printing as I read them into Stockfish. Out of desperation, I tried all possible variations of transforming the matrices, to no avail. The results of games are -infinite elo to the updated Network, which implies a failure to load or quantize the weights correctly.

Can you confirm that those two .py files produce working Networks? Perhaps I need a fresh set of eyes, but porting to this format should have been trivial.
Aren't they outputting flip-nets? (as opposed to 180 degrees rotating)
So not compatible with current SF. Noob should know what lines have to be changed for flip nets to work.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

Making good progress! Still not at SF master level, but we are getting closer. Adding the factorizer (even a limited version) did indeed make a huge difference.

Image

The orange curve is loss curve w/o factorizer, blue is with. So the net trains much faster, and to a lower validation loss (validation loss is on depth 10 data, while training is on depth 5 right now). I will do a little write-up on learnings here in a reply to a question about the factorizer.

The best net we've tested on fishtest so far is at -87 elo to master, which is getting closer :). https://tests.stockfishchess.org/tests/ ... 2301d6ada8
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

AndrewGrant wrote: Wed Nov 18, 2020 11:38 am Gary,

I'm trying to output my networks with an SF format. You have a couple different versions going in in your repo, but do https://github.com/glinscott/nnue-pytor ... rialize.py and https://github.com/glinscott/nnue-pytor ... r/model.py correspond to what would be matching Nodchip's outputs?

I've trained a network using the same loss as your model, and loss looks good. However, loading the weights into Stockfish is failing. I can confirm that I am not having an off by one issue; I can print weights/biases as I output them, and then they match printing as I read them into Stockfish. Out of desperation, I tried all possible variations of transforming the matrices, to no avail. The results of games are -infinite elo to the updated Network, which implies a failure to load or quantize the weights correctly.

Can you confirm that those two .py files produce working Networks? Perhaps I need a fresh set of eyes, but porting to this format should have been trivial.
Yes, that's correct. The process to export to SF is to train a model using `python train.py` (which uses model.py), and then export the saved checkpoint to SF format using `python serialize.py last.ckpt nn.nnue`.

There were a few gotchas I hit when serializing nets to SF format. The output layer having different quantization weights was one. Making sure to train with clipped relu was another (originally I was using unclipped, and it didn't convert at all). Otherwise, making sure the rows/columns line up as expected.

qmodel.py was an experiment for quantizing using the pytorch quantization process, which I still think has great promise, but it's not a high priority until we are beating master with the SF quantized nets. That will prove the data + training process is super solid, and then we can do a bunch of fun optimizations :).
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

Raphexon wrote: Wed Nov 18, 2020 1:32 pm
AndrewGrant wrote: Wed Nov 18, 2020 11:38 am Gary,

I'm trying to output my networks with an SF format. You have a couple different versions going in in your repo, but do https://github.com/glinscott/nnue-pytor ... rialize.py and https://github.com/glinscott/nnue-pytor ... r/model.py correspond to what would be matching Nodchip's outputs?

I've trained a network using the same loss as your model, and loss looks good. However, loading the weights into Stockfish is failing. I can confirm that I am not having an off by one issue; I can print weights/biases as I output them, and then they match printing as I read them into Stockfish. Out of desperation, I tried all possible variations of transforming the matrices, to no avail. The results of games are -infinite elo to the updated Network, which implies a failure to load or quantize the weights correctly.

Can you confirm that those two .py files produce working Networks? Perhaps I need a fresh set of eyes, but porting to this format should have been trivial.
Aren't they outputting flip-nets? (as opposed to 180 degrees rotating)
So not compatible with current SF. Noob should know what lines have to be changed for flip nets to work.
No, the pytorch trainer is using rotate nets to match SF master.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

xr_a_y wrote: Tue Nov 17, 2020 4:53 pm
gladius wrote: Tue Nov 17, 2020 4:40 pm It's a method to augment/generalize the data the net is training on. Since halfkp splits out piece positions by king position, the net loses some generalization. The factorizer adds those "simple" features (eg. pure piece position, not dependent on king) during training, and then sums up all the values of the relevant features when exporting the net. The cool part is that you only need them while training - not at inference time, which is a really nice speed win.
Can you say a little more how it is done theoretically please without affecting the net topology being used ?
Sure, let's look at the piece factorizer. With halfkp you have 41024 inputs to the net. The piece factorizer adds 64*10 inputs, one for each piece type, and each square the piece can be on. So, you then train on 41664 inputs to the training net. The trick is that given an index into the halfkp net, you know that if that index is set, which corresponding index will be set in the piece features. And since the inputs to our net are always 0 or 1, you can sum those two rows of the weights matrix together, and it will give the same result as if you had two separate inputs.

The (very ugly currently) code for this is here, which might be a bit easier to understand than the nodchip version: https://github.com/glinscott/nnue-pytor ... factorizer
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: Pytorch NNUE training

Post by elcabesa »

thank you for your explanation and code

let me recap what I have understood on factorizer so far:
1) you have 2 different network topologies, 1 one for training (TNNUE) and 1 for the engine (ENNUE)
2) you have a trick when saving the weight in the trainer so TNNUE and ENNUE have the same output for a given input
3) ENNUE is faster to calculate (good for engine)
4) the additional TNNUE features (64+640) are less sparser
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

elcabesa wrote: Thu Nov 19, 2020 7:36 am thank you for your explanation and code

let me recap what I have understood on factorizer so far:
1) you have 2 different network topologies, 1 one for training (TNNUE) and 1 for the engine (ENNUE)
2) you have a trick when saving the weight in the trainer so TNNUE and ENNUE have the same output for a given input
3) ENNUE is faster to calculate (good for engine)
4) the additional TNNUE features (64+640) are less sparser
Yes, that's a great description. There is one trick with the king factorization. Since for each piece on the board, there is the same associated king square (which maps to the same feature), we end up "overcounting" when distributing the weights. There are a few approaches around this, I took a direct one and divided the weight by 20 (a very rough approximation of number of pieces on the board). I think a better fix is just including the King in the Piece factorization, but that remains to be tested.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: Pytorch NNUE training

Post by Rein Halbersma »

gladius wrote: Sun Nov 08, 2020 9:56 pm I started an implementation of the SF NNUE training in Pytorch: https://github.com/glinscott/nnue-pytorch.
Perhaps a silly question, but I'm having a hard time understanding this line in the step_ function https://github.com/glinscott/nnue-pytor ... del.py#L53

Code: Select all

    q = self(us, them, white, black)
What function is being called here? An __init__ or a __call__ operator? But where are these defined? The pl.LightningModule docs don't show that this class is Callable AFAICS.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

Rein Halbersma wrote: Fri Nov 20, 2020 11:09 am
gladius wrote: Sun Nov 08, 2020 9:56 pm I started an implementation of the SF NNUE training in Pytorch: https://github.com/glinscott/nnue-pytorch.
Perhaps a silly question, but I'm having a hard time understanding this line in the step_ function https://github.com/glinscott/nnue-pytor ... del.py#L53

Code: Select all

    q = self(us, them, white, black)
What function is being called here? An __init__ or a __call__ operator? But where are these defined? The pl.LightningModule docs don't show that this class is Callable AFAICS.
That's from the pytorch torch.nn.Module, and it's calling forward(), but also supports running hooks.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: Pytorch NNUE training

Post by Rein Halbersma »

gladius wrote: Fri Nov 20, 2020 3:26 pm That's from the pytorch torch.nn.Module, and it's calling forward(), but also supports running hooks.
Thanks, that's a surprising but useful idiom.