Pytorch NNUE training

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Pytorch NNUE training

Post by gladius »

Progress is still being made! It's been much tougher than anticipated, but we are only 5 elo from the best SV net now. That SV net was then optimized using SPSA to produce the best master net, so still about 15 elo from master. Best results from @vondele (who has been training up a storm):

Code: Select all

   # PLAYER                 :  RATING  ERROR   POINTS  PLAYED   (%)  CFS(%)
   1 master                 :     0.0   ----  55594.5  106111    52     100
   2 SV                     :   -10.0    1.9  20633.5   42472    49      99
   3 run79/run2/scale361    :   -14.5    3.2   7668.5   16000    48     100
There have been a lot of small changes that have been tested extensively to get to this point:
1. A ton of experimentation on the best data to use. Current best data uses generated games with a fair bit of randomness, then rescored using a d9 pass after the fact.
2. Switching to using the ranger optimizer: https://github.com/lessw2020/Ranger-Dee ... -Optimizer
3. Visualizing the activations of the net, discovering that the pytorch trained nets had numerous dead neurons. That led to further tuning of the optimizer parameters (in particular, the ADAM eps value had a large effect on the dead neurons - suspicion is this comes from the very low loss values that we typically see during training). See https://github.com/glinscott/nnue-pytorch/issues/17 and https://github.com/official-stockfish/S ... ssues/3274 for background.

Eg. compare master net:
Image

vs old pytorch net:
Image

4. Switching scaling of the evaluations to 361 from 600 (I had mentioned this earlier, but it did end up being beneficial).
5. Introducing learning rate drops by 0.3x every 75 epochs.
6. Shuffling the input data. We do multiple passes through the data, so shuffling helped a lot.
7. Only using every N positions in the data (eg. 7). This helps so that each batch gets to see more variety of games, as opposed to a bunch of positions from one game. We end up seeing all the data in multiple passes through, so it doesn't reduce the effective data size.
8. Reducing batch size from 16384 to 8192.
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Pytorch NNUE training

Post by xr_a_y »

Did you checked resulting weights are enough well distributed and inside [-2;2] so that quantization is ok ?
Without using L2 regularization, it is probably possible that some weights are too big to be quantized without being clamped.
But using too much regularization (decay) will lead to weight in a smaller range than [-2,2] and thus a bad precision in quantization.
My tries with quantization in Minic showed me that int8 quantization is not an easy target at all.