SF-NNUE - failed to store learned nn.bin

frankp · Post by **frankp** » Sat Aug 01, 2020 1:26 am

Downloaded the precompiled version of SF-NNUE and followed the receipe for producing a net - a small test run of 1M loops for the training data and 100k for the validation.
All seemed to progress fine, but no net was stored in ./eval/final when the process finished.
No idea why. The final output from the learning phase is listed below. If anyone else had this problem and can tell what I did wrong, I would be grateful.
(Everything, including the "stockfish....nnue-learn.exe" prog was in a test folder. With evalsave, trainingdata and validationdata as subfolders.)

//-----------------------------------------------------------------------------------------------------------------------
INFO: largest min activation = 0, smallest max activation = 0.210686
PROGRESS: Fri Jul 31 23:00:43 2020, 99000007 sfens, iteration 99, eta = 1, hirate eval = 20 , test_cross_entropy_eval = 0.358933 , test_cross_entropy_win = 0.683994 , test_entropy_eval = 0.220314 , test_entropy_win = 0.184951 , test_cross_entropy = 0.358933 , test_entropy = 0.220314 , norm = 1.17898e+08 , move accuracy = 25.731% , learn_cross_entropy_eval = 0.256236 , learn_cross_entropy_win = 0.712858 , learn_entropy_eval = 0.215104 , learn_entropy_win = 0.188667 , learn_cross_entropy = 0.256236 , learn_entropy = 0.215104
INFO: observed 39071 (out of 43979) features
INFO: (min, max) of pre-activations = -2.77732, 2.18353 (limit = 258.008)
INFO: largest min activation = 0, smallest max activation = 0.5501
INFO: largest min activation = 0.408566, smallest max activation = 0.155009
INFO: largest min activation = 0, smallest max activation = 0.23154

finalize..all threads are joined.
info string SkipLoadingEval set to true, Net not loaded!
Check Sum = 0
save_eval() start. folder = evalsave/final
PS D:\downloads\chess\stockfish-nnue-2020-07-19\test>

dkappe · Post by **dkappe** » Sat Aug 01, 2020 3:13 am

There are some bugs around this.

Search for all instances of “evalsave.” The last time it rejects a checkpoint, it’ll restore from the last accepted net. It’ll be in evalsave/9/nn.bin (just using 9 as an example).

frankp · Post by **frankp** » Sat Aug 01, 2020 9:57 am

Thanks for replying.
Nothing written anyway, as far as I can tell.
Repeated the process with the same result.
No idea what I am doing. Just following the readme receipe, so perhaps not a surprising result

Joerg Oster · Post by **Joerg Oster** » Sat Aug 01, 2020 10:27 am

frankp wrote: ↑Sat Aug 01, 2020 9:57 am Thanks for replying.
Nothing written anyway, as far as I can tell.
Repeated the process with the same result.
No idea what I am doing. Just following the readme receipe, so perhaps not a surprising result

How large did you choose 'eval_save_interval'?
Did you ever see a message like

Code: Select all

save_eval() start. folder = evalsave/4
save_eval() finished. folder = evalsave/4
loss: 0.230231 < best (0.231039), accepted

If after 99 iterations nothing was saved in 'evalsave' folder,
it can only mean this interval was set too large.

frankp · Post by **frankp** » Sat Aug 01, 2020 10:46 am

Just cut-and-pasted the Readme commands - without understanding. See below.
So ... eval_save_interval 250000000
Perhaps then did not have enough data for eval_save to be triggered. Loop=1M and validation 100k as a quick test
"gensfen depth 2 loop 1000000 use_draw_in_training_data_generation 1 eval_limit 32000"

//----------------------------------------------------------------------------
learn targetdir trainingdata loop 100 batchsize 1000000 use_draw_in_training 1 use_draw_in_validation 1 eta 1 lambda 1 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.5 eval_save_interval 250000000 loss_output_interval 1000000 mirror_percentage 50 validation_set_file_name validationdata\val.bin

Joerg Oster · Post by **Joerg Oster** » Sat Aug 01, 2020 12:04 pm

frankp wrote: ↑Sat Aug 01, 2020 10:46 am Just cut-and-pasted the Readme commands - without understanding. See below.
So ... eval_save_interval 250000000
Perhaps then did not have enough data for eval_save to be triggered. Loop=1M and validation 100k as a quick test
"gensfen depth 2 loop 1000000 use_draw_in_training_data_generation 1 eval_limit 32000"

//----------------------------------------------------------------------------
learn targetdir trainingdata loop 100 batchsize 1000000 use_draw_in_training 1 use_draw_in_validation 1 eta 1 lambda 1 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.5 eval_save_interval 250000000 loss_output_interval 1000000 mirror_percentage 50 validation_set_file_name validationdata\val.bin

Try with eval_save_interval 10000000 instead. This will save a net file every 10th iteration in your case.
Of course, this is a shortcoming of the current code, which simply assumes there is a best net file available in one of the earlier save folders.

Another possibility is to use "eval_save_once" which will save the net file only once when the training is finished.
But I never tried this and don't know if this is working (although I think it should).

frankp · Post by **frankp** » Sat Aug 01, 2020 12:19 pm

Yes, reducing the eval_save_interval saves the intermediate nets.
But the final net still does not save.
Progress at least. Guess the files I am using are just too small.

//------------------------------------------------------------------------------------------------------------
setoption name EvalSaveDir value evalsave

learn targetdir trainingdata loop 100 batchsize 1000000 eta 1.0 lambda 0.5 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.5 eval_save_interval 10000 loss_output_interval 1000000 mirror_percentage 50 validation_set_file_name validationdata\val.bin

Check Sum = 0
save_eval() start. folder = evalsave/7
save_eval() finished. folder = evalsave/7
loss: 0.187053 >= best (0.186981), rejected
restoring parameters from evalsave/5
converged

finalize..all threads are joined.
info string SkipLoadingEval set to true, Net not loaded!
Check Sum = 0
save_eval() start. folder = evalsave/final
PS D:\downloads\chess\stockfish-nnue-2020-07-19\test>

dkappe · Post by **dkappe** » Sat Aug 01, 2020 7:57 pm

If you are using the default settings and the run stopped because of two rejects in a row, then the net that would have been written as the final will be a previous checkpoint. So, if your last save was in 13, then your final net will be in evalsave/11/nn.bin

I have a bug report in for this.

SF-NNUE - failed to store learned nn.bin

SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin

Re: SF-NNUE - failed to store learned nn.bin