Pytorch NNUE training

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, Dann Corbit, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
xr_a_y
Posts: 1338
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Pytorch NNUE training

Post by xr_a_y » Tue Nov 17, 2020 7:01 am

For now, I have no gpu, so I run on cpu only.
batch_size is 256, 8 workers => it takes more than 1h for 10Msfens

Unless "step" in the tensorboard view is not equivalent to sfens ...

So question is, do I need a gpu or is there something else going wrong ?

David Carteau
Posts: 85
Joined: Sat May 24, 2014 7:09 am
Location: France
Full name: David Carteau
Contact:

Re: Pytorch NNUE training

Post by David Carteau » Tue Nov 17, 2020 8:24 am

xr_a_y wrote:
Tue Nov 17, 2020 7:01 am
For now, I have no gpu, so I run on cpu only.
batch_size is 256, 8 workers => it takes more than 1h for 10Msfens

Unless "step" in the tensorboard view is not equivalent to sfens ...

So question is, do I need a gpu or is there something else going wrong ?
Hi Vivien,

Until now, I was using only CPU to train my networks, with an average speed of 3 days for processing 360 million positions (i.e. my full set of training data, i.e. an "epoch"). I've just adapted my pytorch script to use GPU and I'll tell you what is the performance gain. As I understood, "step" is performed to compute the gradients at the end of a "batch" (i.e. a subset of your training data).

Regards,

David

David Carteau
Posts: 85
Joined: Sat May 24, 2014 7:09 am
Location: France
Full name: David Carteau
Contact:

Re: Pytorch NNUE training

Post by David Carteau » Tue Nov 17, 2020 8:33 am

gladius wrote:
Mon Nov 16, 2020 9:25 pm
AndrewGrant wrote:
Mon Nov 16, 2020 8:38 pm
gladius wrote:
Mon Nov 16, 2020 4:45 pm
Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.
I had recent success while training a NNUE-style network for my Orion NNUE experiment with a batch size of... 256k ! Maybe, my choice was far to be optimal... I found this size by trying to have decent convergence while setting both 'batch size' and 'learning rate' parameters. I will try later other combinations. There's a lot of randomness (and try/error attempts) for fixing these values, having in mind that you can also choose the optimisation algorithm (SGD, Adam, etc.) :)

connor_mcmonigle
Posts: 66
Joined: Sun Sep 06, 2020 2:40 am
Full name: Connor McMonigle

Re: Pytorch NNUE training

Post by connor_mcmonigle » Tue Nov 17, 2020 8:42 am

xr_a_y wrote:
Tue Nov 17, 2020 7:01 am
For now, I have no gpu, so I run on cpu only.
batch_size is 256, 8 workers => it takes more than 1h for 10Msfens

Unless "step" in the tensorboard view is not equivalent to sfens ...

So question is, do I need a gpu or is there something else going wrong ?
I've configured tensorboard such that the x axis of the loss graph corresponds to number of packed FENs observed. My training code is slow, but, even with a modest GPU, you'd get far superior performance. I would say you need a GPU :D

elcabesa
Posts: 848
Joined: Sun May 23, 2010 11:32 am
Contact:

Re: Pytorch NNUE training

Post by elcabesa » Tue Nov 17, 2020 12:05 pm

elcabesa wrote:
Mon Nov 16, 2020 6:05 pm
gladius wrote:
Mon Nov 16, 2020 4:47 pm
AndrewGrant wrote:
Mon Nov 16, 2020 7:02 am
gladius wrote:
Mon Nov 16, 2020 6:11 am
Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it :). I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?
I try to reask this question, what is factorizer? after looking inside the code it seems a "quantizer". Am I wrong?

gladius
Posts: 565
Joined: Tue Dec 12, 2006 9:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius » Tue Nov 17, 2020 3:37 pm

AndrewGrant wrote:
Mon Nov 16, 2020 11:49 pm
gladius wrote:
Mon Nov 16, 2020 9:25 pm
AndrewGrant wrote:
Mon Nov 16, 2020 8:38 pm
gladius wrote:
Mon Nov 16, 2020 4:45 pm
Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.
Interesting. Just about the same speed as me, running on the CPU with very many threads in a C program. Can I ask what your system is?
This was on a v100. Very impressive speed for your trainer! The GPU path is not exploiting sparsity for backprop, so we take a hit there.

gladius
Posts: 565
Joined: Tue Dec 12, 2006 9:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius » Tue Nov 17, 2020 3:40 pm

elcabesa wrote:
Tue Nov 17, 2020 12:05 pm
elcabesa wrote:
Mon Nov 16, 2020 6:05 pm
gladius wrote:
Mon Nov 16, 2020 4:47 pm
AndrewGrant wrote:
Mon Nov 16, 2020 7:02 am
gladius wrote:
Mon Nov 16, 2020 6:11 am
Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it :). I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?
I try to reask this question, what is factorizer? after looking inside the code it seems a "quantizer". Am I wrong?
It's a method to augment/generalize the data the net is training on. Since halfkp splits out piece positions by king position, the net loses some generalization. The factorizer adds those "simple" features (eg. pure piece position, not dependent on king) during training, and then sums up all the values of the relevant features when exporting the net. The cool part is that you only need them while training - not at inference time, which is a really nice speed win.

User avatar
xr_a_y
Posts: 1338
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Pytorch NNUE training

Post by xr_a_y » Tue Nov 17, 2020 3:53 pm

gladius wrote:
Tue Nov 17, 2020 3:40 pm
elcabesa wrote:
Tue Nov 17, 2020 12:05 pm
elcabesa wrote:
Mon Nov 16, 2020 6:05 pm
gladius wrote:
Mon Nov 16, 2020 4:47 pm
AndrewGrant wrote:
Mon Nov 16, 2020 7:02 am
gladius wrote:
Mon Nov 16, 2020 6:11 am
Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it :). I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?
I try to reask this question, what is factorizer? after looking inside the code it seems a "quantizer". Am I wrong?
It's a method to augment/generalize the data the net is training on. Since halfkp splits out piece positions by king position, the net loses some generalization. The factorizer adds those "simple" features (eg. pure piece position, not dependent on king) during training, and then sums up all the values of the relevant features when exporting the net. The cool part is that you only need them while training - not at inference time, which is a really nice speed win.
Can you say a little more how it is done theoretically please without affecting the net topology being used ?

Daniel Shawul
Posts: 4066
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Pytorch NNUE training

Post by Daniel Shawul » Tue Nov 17, 2020 5:23 pm

David Carteau wrote:
Tue Nov 17, 2020 8:33 am
gladius wrote:
Mon Nov 16, 2020 9:25 pm
AndrewGrant wrote:
Mon Nov 16, 2020 8:38 pm
gladius wrote:
Mon Nov 16, 2020 4:45 pm
Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.
I had recent success while training a NNUE-style network for my Orion NNUE experiment with a batch size of... 256k ! Maybe, my choice was far to be optimal... I found this size by trying to have decent convergence while setting both 'batch size' and 'learning rate' parameters. I will try later other combinations. There's a lot of randomness (and try/error attempts) for fixing these values, having in mind that you can also choose the optimisation algorithm (SGD, Adam, etc.) :)
Isn't 256000 a really big mini-batch size ? That almost turns mini-batch gradient descent to batch gradient descent algorithm.

AndrewGrant
Posts: 876
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: Pytorch NNUE training

Post by AndrewGrant » Wed Nov 18, 2020 10:38 am

Gary,

I'm trying to output my networks with an SF format. You have a couple different versions going in in your repo, but do https://github.com/glinscott/nnue-pytor ... rialize.py and https://github.com/glinscott/nnue-pytor ... r/model.py correspond to what would be matching Nodchip's outputs?

I've trained a network using the same loss as your model, and loss looks good. However, loading the weights into Stockfish is failing. I can confirm that I am not having an off by one issue; I can print weights/biases as I output them, and then they match printing as I read them into Stockfish. Out of desperation, I tried all possible variations of transforming the matrices, to no avail. The results of games are -infinite elo to the updated Network, which implies a failure to load or quantize the weights correctly.

Can you confirm that those two .py files produce working Networks? Perhaps I need a fresh set of eyes, but porting to this format should have been trivial.

Post Reply