Pytorch NNUE training

xr_a_y · Post by **xr_a_y** » Tue Nov 17, 2020 8:01 am

For now, I have no gpu, so I run on cpu only.
batch_size is 256, 8 workers => it takes more than 1h for 10Msfens

Unless "step" in the tensorboard view is not equivalent to sfens ...

So question is, do I need a gpu or is there something else going wrong ?

David Carteau · Post by **David Carteau** » Tue Nov 17, 2020 9:24 am

xr_a_y wrote: ↑Tue Nov 17, 2020 8:01 am For now, I have no gpu, so I run on cpu only.
batch_size is 256, 8 workers => it takes more than 1h for 10Msfens

Unless "step" in the tensorboard view is not equivalent to sfens ...

So question is, do I need a gpu or is there something else going wrong ?

Hi Vivien,

Until now, I was using only CPU to train my networks, with an average speed of 3 days for processing 360 million positions (i.e. my full set of training data, i.e. an "epoch"). I've just adapted my pytorch script to use GPU and I'll tell you what is the performance gain. As I understood, "step" is performed to compute the gradients at the end of a "batch" (i.e. a subset of your training data).

Regards,

David

David Carteau · Post by **David Carteau** » Tue Nov 17, 2020 9:33 am

gladius wrote: ↑Mon Nov 16, 2020 10:25 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 9:38 pm
gladius wrote: ↑Mon Nov 16, 2020 5:45 pm Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.

I had recent success while training a NNUE-style network for my Orion NNUE experiment with a batch size of... 256k ! Maybe, my choice was far to be optimal... I found this size by trying to have decent convergence while setting both 'batch size' and 'learning rate' parameters. I will try later other combinations. There's a lot of randomness (and try/error attempts) for fixing these values, having in mind that you can also choose the optimisation algorithm (SGD, Adam, etc.)

connor_mcmonigle · Post by **connor_mcmonigle** » Tue Nov 17, 2020 9:42 am

xr_a_y wrote: ↑Tue Nov 17, 2020 8:01 am For now, I have no gpu, so I run on cpu only.
batch_size is 256, 8 workers => it takes more than 1h for 10Msfens

Unless "step" in the tensorboard view is not equivalent to sfens ...

So question is, do I need a gpu or is there something else going wrong ?

I've configured tensorboard such that the x axis of the loss graph corresponds to number of packed FENs observed. My training code is slow, but, even with a modest GPU, you'd get far superior performance. I would say you need a GPU

elcabesa · Post by **elcabesa** » Tue Nov 17, 2020 1:05 pm

elcabesa wrote: ↑Mon Nov 16, 2020 7:05 pm
gladius wrote: ↑Mon Nov 16, 2020 5:47 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 8:02 am
gladius wrote: ↑Mon Nov 16, 2020 7:11 am Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it . I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?

I try to reask this question, what is factorizer? after looking inside the code it seems a "quantizer". Am I wrong?

gladius · Post by **gladius** » Tue Nov 17, 2020 4:37 pm

AndrewGrant wrote: ↑Tue Nov 17, 2020 12:49 am
gladius wrote: ↑Mon Nov 16, 2020 10:25 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 9:38 pm
gladius wrote: ↑Mon Nov 16, 2020 5:45 pm Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.
Interesting. Just about the same speed as me, running on the CPU with very many threads in a C program. Can I ask what your system is?

This was on a v100. Very impressive speed for your trainer! The GPU path is not exploiting sparsity for backprop, so we take a hit there.

gladius · Post by **gladius** » Tue Nov 17, 2020 4:40 pm

elcabesa wrote: ↑Tue Nov 17, 2020 1:05 pm
elcabesa wrote: ↑Mon Nov 16, 2020 7:05 pm
gladius wrote: ↑Mon Nov 16, 2020 5:47 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 8:02 am
gladius wrote: ↑Mon Nov 16, 2020 7:11 am Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it . I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?
I try to reask this question, what is factorizer? after looking inside the code it seems a "quantizer". Am I wrong?

It's a method to augment/generalize the data the net is training on. Since halfkp splits out piece positions by king position, the net loses some generalization. The factorizer adds those "simple" features (eg. pure piece position, not dependent on king) during training, and then sums up all the values of the relevant features when exporting the net. The cool part is that you only need them while training - not at inference time, which is a really nice speed win.

xr_a_y · Post by **xr_a_y** » Tue Nov 17, 2020 4:53 pm

gladius wrote: ↑Tue Nov 17, 2020 4:40 pm
elcabesa wrote: ↑Tue Nov 17, 2020 1:05 pm
elcabesa wrote: ↑Mon Nov 16, 2020 7:05 pm
gladius wrote: ↑Mon Nov 16, 2020 5:47 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 8:02 am
gladius wrote: ↑Mon Nov 16, 2020 7:11 am Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it . I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?
I try to reask this question, what is factorizer? after looking inside the code it seems a "quantizer". Am I wrong?
It's a method to augment/generalize the data the net is training on. Since halfkp splits out piece positions by king position, the net loses some generalization. The factorizer adds those "simple" features (eg. pure piece position, not dependent on king) during training, and then sums up all the values of the relevant features when exporting the net. The cool part is that you only need them while training - not at inference time, which is a really nice speed win.

Can you say a little more how it is done theoretically please without affecting the net topology being used ?

Daniel Shawul · Post by **Daniel Shawul** » Tue Nov 17, 2020 6:23 pm

David Carteau wrote: ↑Tue Nov 17, 2020 9:33 am
gladius wrote: ↑Mon Nov 16, 2020 10:25 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 9:38 pm
gladius wrote: ↑Mon Nov 16, 2020 5:45 pm Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.
I had recent success while training a NNUE-style network for my Orion NNUE experiment with a batch size of... 256k ! Maybe, my choice was far to be optimal... I found this size by trying to have decent convergence while setting both 'batch size' and 'learning rate' parameters. I will try later other combinations. There's a lot of randomness (and try/error attempts) for fixing these values, having in mind that you can also choose the optimisation algorithm (SGD, Adam, etc.)

Isn't 256000 a really big mini-batch size ? That almost turns mini-batch gradient descent to batch gradient descent algorithm.

AndrewGrant · Post by **AndrewGrant** » Wed Nov 18, 2020 11:38 am

Gary,

I'm trying to output my networks with an SF format. You have a couple different versions going in in your repo, but do https://github.com/glinscott/nnue-pytor ... rialize.py and https://github.com/glinscott/nnue-pytor ... r/model.py correspond to what would be matching Nodchip's outputs?

I've trained a network using the same loss as your model, and loss looks good. However, loading the weights into Stockfish is failing. I can confirm that I am not having an off by one issue; I can print weights/biases as I output them, and then they match printing as I read them into Stockfish. Out of desperation, I tried all possible variations of transforming the matrices, to no avail. The results of games are -infinite elo to the updated Network, which implies a failure to load or quantize the weights correctly.

Can you confirm that those two .py files produce working Networks? Perhaps I need a fresh set of eyes, but porting to this format should have been trivial.

Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training