SF-NNUE - failed to store learned nn.bin

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

frankp
Posts: 228
Joined: Sun Mar 12, 2006 3:11 pm

SF-NNUE - failed to store learned nn.bin

Post by frankp »

Downloaded the precompiled version of SF-NNUE and followed the receipe for producing a net - a small test run of 1M loops for the training data and 100k for the validation.
All seemed to progress fine, but no net was stored in ./eval/final when the process finished.
No idea why. The final output from the learning phase is listed below. If anyone else had this problem and can tell what I did wrong, I would be grateful.
(Everything, including the "stockfish....nnue-learn.exe" prog was in a test folder. With evalsave, trainingdata and validationdata as subfolders.)

//-----------------------------------------------------------------------------------------------------------------------
INFO: largest min activation = 0, smallest max activation = 0.210686
PROGRESS: Fri Jul 31 23:00:43 2020, 99000007 sfens, iteration 99, eta = 1, hirate eval = 20 , test_cross_entropy_eval = 0.358933 , test_cross_entropy_win = 0.683994 , test_entropy_eval = 0.220314 , test_entropy_win = 0.184951 , test_cross_entropy = 0.358933 , test_entropy = 0.220314 , norm = 1.17898e+08 , move accuracy = 25.731% , learn_cross_entropy_eval = 0.256236 , learn_cross_entropy_win = 0.712858 , learn_entropy_eval = 0.215104 , learn_entropy_win = 0.188667 , learn_cross_entropy = 0.256236 , learn_entropy = 0.215104
INFO: observed 39071 (out of 43979) features
INFO: (min, max) of pre-activations = -2.77732, 2.18353 (limit = 258.008)
INFO: largest min activation = 0, smallest max activation = 0.5501
INFO: largest min activation = 0.408566, smallest max activation = 0.155009
INFO: largest min activation = 0, smallest max activation = 0.23154

finalize..all threads are joined.
info string SkipLoadingEval set to true, Net not loaded!
Check Sum = 0
save_eval() start. folder = evalsave/final
PS D:\downloads\chess\stockfish-nnue-2020-07-19\test>
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: SF-NNUE - failed to store learned nn.bin

Post by dkappe »

There are some bugs around this.

Search for all instances of “evalsave.” The last time it rejects a checkpoint, it’ll restore from the last accepted net. It’ll be in evalsave/9/nn.bin (just using 9 as an example).
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
frankp
Posts: 228
Joined: Sun Mar 12, 2006 3:11 pm

Re: SF-NNUE - failed to store learned nn.bin

Post by frankp »

Thanks for replying.
Nothing written anyway, as far as I can tell.
Repeated the process with the same result.
No idea what I am doing. Just following the readme receipe, so perhaps not a surprising result :)
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: SF-NNUE - failed to store learned nn.bin

Post by Joerg Oster »

frankp wrote: Sat Aug 01, 2020 9:57 am Thanks for replying.
Nothing written anyway, as far as I can tell.
Repeated the process with the same result.
No idea what I am doing. Just following the readme receipe, so perhaps not a surprising result :)
How large did you choose 'eval_save_interval'?
Did you ever see a message like

Code: Select all

save_eval() start. folder = evalsave/4
save_eval() finished. folder = evalsave/4
loss: 0.230231 < best (0.231039), accepted
If after 99 iterations nothing was saved in 'evalsave' folder,
it can only mean this interval was set too large.
Jörg Oster
frankp
Posts: 228
Joined: Sun Mar 12, 2006 3:11 pm

Re: SF-NNUE - failed to store learned nn.bin

Post by frankp »

Just cut-and-pasted the Readme commands - without understanding. See below.
So ... eval_save_interval 250000000
Perhaps then did not have enough data for eval_save to be triggered. Loop=1M and validation 100k as a quick test
"gensfen depth 2 loop 1000000 use_draw_in_training_data_generation 1 eval_limit 32000"

//----------------------------------------------------------------------------
learn targetdir trainingdata loop 100 batchsize 1000000 use_draw_in_training 1 use_draw_in_validation 1 eta 1 lambda 1 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.5 eval_save_interval 250000000 loss_output_interval 1000000 mirror_percentage 50 validation_set_file_name validationdata\val.bin
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: SF-NNUE - failed to store learned nn.bin

Post by Joerg Oster »

frankp wrote: Sat Aug 01, 2020 10:46 am Just cut-and-pasted the Readme commands - without understanding. See below.
So ... eval_save_interval 250000000
Perhaps then did not have enough data for eval_save to be triggered. Loop=1M and validation 100k as a quick test
"gensfen depth 2 loop 1000000 use_draw_in_training_data_generation 1 eval_limit 32000"

//----------------------------------------------------------------------------
learn targetdir trainingdata loop 100 batchsize 1000000 use_draw_in_training 1 use_draw_in_validation 1 eta 1 lambda 1 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.5 eval_save_interval 250000000 loss_output_interval 1000000 mirror_percentage 50 validation_set_file_name validationdata\val.bin
Try with eval_save_interval 10000000 instead. This will save a net file every 10th iteration in your case.
Of course, this is a shortcoming of the current code, which simply assumes there is a best net file available in one of the earlier save folders.

Another possibility is to use "eval_save_once" which will save the net file only once when the training is finished.
But I never tried this and don't know if this is working (although I think it should).
Jörg Oster
frankp
Posts: 228
Joined: Sun Mar 12, 2006 3:11 pm

Re: SF-NNUE - failed to store learned nn.bin

Post by frankp »

Yes, reducing the eval_save_interval saves the intermediate nets.
But the final net still does not save.
Progress at least. Guess the files I am using are just too small.


//------------------------------------------------------------------------------------------------------------
setoption name EvalSaveDir value evalsave

learn targetdir trainingdata loop 100 batchsize 1000000 eta 1.0 lambda 0.5 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.5 eval_save_interval 10000 loss_output_interval 1000000 mirror_percentage 50 validation_set_file_name validationdata\val.bin


Check Sum = 0
save_eval() start. folder = evalsave/7
save_eval() finished. folder = evalsave/7
loss: 0.187053 >= best (0.186981), rejected
restoring parameters from evalsave/5
converged

finalize..all threads are joined.
info string SkipLoadingEval set to true, Net not loaded!
Check Sum = 0
save_eval() start. folder = evalsave/final
PS D:\downloads\chess\stockfish-nnue-2020-07-19\test>
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: SF-NNUE - failed to store learned nn.bin

Post by dkappe »

If you are using the default settings and the run stopped because of two rejects in a row, then the net that would have been written as the final will be a previous checkpoint. So, if your last save was in 13, then your final net will be in evalsave/11/nn.bin

I have a bug report in for this.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".