Pytorch NNUE training

AndrewGrant · Post by **AndrewGrant** » Mon Nov 16, 2020 8:02 am

gladius wrote: ↑Mon Nov 16, 2020 7:11 am Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it . I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.

I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.

xr_a_y · Post by **xr_a_y** » Mon Nov 16, 2020 4:08 pm

gladius wrote: ↑Sat Nov 14, 2020 10:30 pm An initial training run by vondele on 256M positions is looking much more promising (also takes about 10 minutes to do an epoch - a pass through the 256M positions, which is great). Getting closer! Note that all training runs so far have been with lambda = 1.0 (train to the evaluation, not the game result).
Code: Select all
Rank Name                      	Elo 	+/-   Games   Score   Draws
   1 master                    	253   	5   16205   81.1%   30.9%
   2 epoch7                    	-6   	4   16205   49.2%   46.3%
   3 epoch6                    	-8   	4   16205   48.8%   46.7%
   4 epoch3                    	-29   	4   16206   45.9%   46.4%
   5 epoch0                   	-190   	5   16205   25.0%   34.1%

Where does this speed come from. 256M sfens in 10 min seems very fast. On my cpu, I need 1day for 100M !
Is that only coming from gpu usage ?

gladius · Post by **gladius** » Mon Nov 16, 2020 5:45 pm

xr_a_y wrote: ↑Mon Nov 16, 2020 4:08 pm
gladius wrote: ↑Sat Nov 14, 2020 10:30 pm An initial training run by vondele on 256M positions is looking much more promising (also takes about 10 minutes to do an epoch - a pass through the 256M positions, which is great). Getting closer! Note that all training runs so far have been with lambda = 1.0 (train to the evaluation, not the game result).
Code: Select all
Rank Name                      	Elo 	+/-   Games   Score   Draws
   1 master                    	253   	5   16205   81.1%   30.9%
   2 epoch7                    	-6   	4   16205   49.2%   46.3%
   3 epoch6                    	-8   	4   16205   48.8%   46.7%
   4 epoch3                    	-29   	4   16206   45.9%   46.4%
   5 epoch0                   	-190   	5   16205   25.0%   34.1%
Where does this speed come from. 256M sfens in 10 min seems very fast. On my cpu, I need 1day for 100M !
Is that only coming from gpu usage ?

Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).

gladius · Post by **gladius** » Mon Nov 16, 2020 5:47 pm

AndrewGrant wrote: ↑Mon Nov 16, 2020 8:02 am
gladius wrote: ↑Mon Nov 16, 2020 7:11 am Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it . I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.

Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

Right now, training at lambda 0.8 (80% score, 20% outcome), and after 11 billion positions, it's still getting better - but still a long way from SF master.

Code: Select all

epoch9  - Elo difference: 208.7 +/- 13.4, LOS: 100.0 %, DrawRatio: 31.8 % 
epoch17 - Elo difference: 184.2 +/- 12.9, LOS: 100.0 %, DrawRatio: 34.8 %
epoch26 - Elo difference: 159.1 +/- 12.3, LOS: 100.0 %, DrawRatio: 38.1 %
epoch33 - Elo difference: 151.5 +/- 18.2, LOS: 100.0 %, DrawRatio: 37.5 %

Still, a far cry from the initial -700 elo disasters

.

elcabesa · Post by **elcabesa** » Mon Nov 16, 2020 7:05 pm

gladius wrote: ↑Mon Nov 16, 2020 5:47 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 8:02 am
gladius wrote: ↑Mon Nov 16, 2020 7:11 am Interesting! One of the experiments I had lined up was disabling the factorizer on the nodchip trainer and seeing how it did. But I’ll take your word for it . I had already started implementing it, there are some really cool tricks the Shogi folks pulled off - zeroing the initial weights for the factored features, and then just summing them at the end when quantizing. Very insightful technique!

Latest experiments have us about -200 elo from master, so a long way to go, but it’s at least going in the right direction.
I implemented a factorizer in what I'm doing. I never looked at SF for it, used a sort of "clean room design" method where Seer author told me about the idea, and had me draw some diagrams. But at the end of the day, any sort of factorization seems like an easy winner. Maybe multiple factorizers wins. May be absolute distance wins.

Not sure. At some point I might attempt to find out, as I can "drag and drop" factorizers into my code atm with a few minutes effort.
Cool, thanks for the tip! Well, will be interesting to compare to a "standard" run.

sorry but what is a factorizer?

AndrewGrant · Post by **AndrewGrant** » Mon Nov 16, 2020 9:38 pm

gladius wrote: ↑Mon Nov 16, 2020 5:45 pm
xr_a_y wrote: ↑Mon Nov 16, 2020 4:08 pm
gladius wrote: ↑Sat Nov 14, 2020 10:30 pm An initial training run by vondele on 256M positions is looking much more promising (also takes about 10 minutes to do an epoch - a pass through the 256M positions, which is great). Getting closer! Note that all training runs so far have been with lambda = 1.0 (train to the evaluation, not the game result).
Code: Select all
Rank Name                      	Elo 	+/-   Games   Score   Draws
   1 master                    	253   	5   16205   81.1%   30.9%
   2 epoch7                    	-6   	4   16205   49.2%   46.3%
   3 epoch6                    	-8   	4   16205   48.8%   46.7%
   4 epoch3                    	-29   	4   16206   45.9%   46.4%
   5 epoch0                   	-190   	5   16205   25.0%   34.1%
Where does this speed come from. 256M sfens in 10 min seems very fast. On my cpu, I need 1day for 100M !
Is that only coming from gpu usage ?
Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).

What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.

gladius · Post by **gladius** » Mon Nov 16, 2020 10:25 pm

AndrewGrant wrote: ↑Mon Nov 16, 2020 9:38 pm
gladius wrote: ↑Mon Nov 16, 2020 5:45 pm Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.

Batch size is 8192.

AndrewGrant · Post by **AndrewGrant** » Tue Nov 17, 2020 12:49 am

gladius wrote: ↑Mon Nov 16, 2020 10:25 pm
AndrewGrant wrote: ↑Mon Nov 16, 2020 9:38 pm
gladius wrote: ↑Mon Nov 16, 2020 5:45 pm Yes, this comes from using the GPU (hooray pytorch!). As well, Sopel implemented a super fast C++ data parser, which feeds the inputs to pytorch as sparse tensors, which was a very large speedup (since the inputs to the first layer are super sparse).
What was the batch size for this? That makes a big difference. Batchsize=1, that is damn fast. Batchsize=1M, that is damn slow.
Batch size is 8192.

Interesting. Just about the same speed as me, running on the CPU with very many threads in a C program. Can I ask what your system is?

connor_mcmonigle · Post by **connor_mcmonigle** » Tue Nov 17, 2020 3:10 am

xr_a_y wrote: ↑Mon Nov 16, 2020 4:08 pm Where does this speed come from. 256M sfens in 10 min seems very fast. On my cpu, I need 1day for 100M !
Is that only coming from gpu usage ?

Hmm... With Seer's current training code, when I tested it a while ago, performance on an RTX 2080 TI was somewhere around 80,000-100,000 packed fens per second, though the CPU was proving a bit of a bottleneck even with 24 threads allocated to data loading. On my home system with a GTX 950, I'm getting more like ~15,000 packed fens per second. Therefore, even if you have a quite weak GPU, it should be possible to significantly outperform your past results with Seer's current training code.

In any case, my training code is currently not competitive with Gary's pytorch-nnue training code in terms of performance. This is both because I'm not using sparse tensors and my data loading code is still written Python.

I'm currently working on a new version of my training code which more tightly integrates with my engine by exposing a good chunk of my engine to Python as a module through PyBind11. This enables me to efficiently load data in C++ into Python using sparse tensors, use the qsearch leaf mapping appproach and control RL data generation all from within my training Python scripts. Additionally, I'm working towards making my incremental updates and board state -> input feature mapping completely generic such that any subset of the set of quadratic board features (piece*piece relations) is supported (I'm inclined to doubt king*piece relations are close to the optimal subset of quadratic features). As I explore more exotic input features, this will enable for guaranteed consistency between Python and my C++ implementation.

This new version, being less standalone, will, unfortunately, likely be far more difficult to introduce into Minic. My recommendation would be to fork Gary's pytorch-nnue code and get it to export Seer style networks if you're interested in getting better performance. This shouldn't be too diffcult as Gary's code was originally roughly based around my training code (it's a bit like the Ship of Theseus at this point with all the improvements and general cleanup though

).

castlehaven · Post by **castlehaven** » Tue Nov 17, 2020 3:12 am

Are we all using “epoch” the same way in this discussion? In the nodchip trainer, the time to train depends (first approximation) on the total number of positions. In this case 256M positions in ten minutes is darn fast, no matter how we break up the batches. The original nodchip syntax would define a full run through the data as batchsize * loop, but in the revised trainer the syntax for a single run through the data would be batchsize* epoch. But in the pytorch implementation, I get the sense that epoch means “full run through 256M positions”, so the nomenclature may be causing confusion (at least for me!)

To summarize my current understanding and ask for help if this is wrong, assume 100M total positions and and a batch size of 1M. Then, in the original nodchip trainer, the syntax would be that “loop 100” would take you a full run through the data, in the new trainer the syntax would be “epoch 100” to go once through the data, and in the new pytorch project the syntax would be that “epoch” refers to 100 “batches of 1M each” = a full run through the 100M positions.

Is this right? The main reason that I ask is that I am trying to figure out the impact on training quality of varying batchsize/loops (old syntax) wh8le keeping the number of positions constant.

Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training

Re: Pytorch NNUE training