Train a neural network evaluation

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Sep 01, 2020 2:25 pm

I'm implementing a neural network that should replace the evaluation of my engine.
I've made a dataset of 200M fens with associated the score of a 4 depth search of my engine.
For the optimization algorithm I have used Adagrad and Gradient descent, the loss function is the sum squared error.
I have trained the net with these methods and I've always get a big error.
The net trained in this way gives a mean error of 200cp that is very high.
How can I improve the training?
I have tried to shuffle the positions in different iterations but it gives a very little improvement. Adagrad compared to a simple Gradient descent gives similar results.
Do you have any suggestion to lower the error of the neural network?

brianr · Post by **brianr** » Tue Sep 01, 2020 3:21 pm

More information about the network architecture and hyper-parameters (learning rate, etc) might provide some clues.

Just a quick stab, but depth 4 does not seem very deep relative to what the SF-NNUE Discord posts talk about.
It might not be enough improvement from your current eval.

Kieren Pearson · Post by **Kieren Pearson** » Tue Sep 01, 2020 3:56 pm

I’m currently doing the same thing so here’s a few pointers.

First, use someone else’s dataset before using your own. It’s just another thing that can go wrong and actually getting quiet positions that are good for training is a pretty complex problem. I recommend the zurichess set with 725K positions. I used it successfully to do texel tuning and it’s working well to train nets ATM to I know that set works.

If adagrad and gradient descent are giving really similar results, start with GD until you start getting good results. No need to overcomplicate until you’re sure the training code is big free.

Are you using a sigmoid transformation (look up texel if your unfamiliar)? I think it’s probably better because the error isn’t dominated by extreme values.

Most likely though the issue is the training data being used. As mentioned a depth of 4 may not be high enough. I would try fewer positions and go for a deeper search. Unless your network is quite large / deep I would start simple with way less positions.

jstanback · Post by **jstanback** » Tue Sep 01, 2020 5:22 pm

I am also trying to implement a NN eval for Wasp. I have an EPD file of about 30M positions from Wasp vs other engines with depth ~15 scores and game result. I randomly choose positions from this file and then do a 25 ply "playout" where only one move is randomly chosen and played. At each position I back-propogate the error between the standard Wasp eval and the NN eval to tweak the weights. I think the crazy random positions are necessary since the NN will need to train the weights for things like a WN on a8 which will rarely happen in a real game. After I train the NN with about 500M of these random positions it seems to help a bit to train with maybe 20-50M of the real game positions using a lower LR. I'm using a sigmoid activation and back-propogate the error as per a tutorial I found. I don't know what learn-rate is typical, but I've mostly been using a learn rate of about 5e-4 for the random positions and 1e-4 for the real positions.

My simple NN of two layers with 4 and 2 nodes seems to be stuck at a bit over 100 Elo worse that the normal Wasp eval. The inputs to this NN are similar to the features that are in the normal Wasp eval (ie PST's, mobility counts, count of attacks near king, passed pawns on rank N than can/cannot advance), so it seems like it should be able to match the strength of Wasp. It trains really fast, maybe 500K positions/sec, so it takes only about 20 minutes to train a NN from scratch, MUCH faster than my terribly slow code for doing Texel tuning. For some reason it doesn't help at all to increase the network size. It's been a lot of fun to experiment with, but I don't know how much more progress I'll make. It would be nice to find a way to train with something other than the normal Wasp eval as the target.

John

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Sep 01, 2020 6:00 pm

brianr wrote: ↑Tue Sep 01, 2020 3:21 pm More information about the network architecture and hyper-parameters (learning rate, etc) might provide some clues.

Just a quick stab, but depth 4 does not seem very deep relative to what the SF-NNUE Discord posts talk about.
It might not be enough improvement from your current eval.

I'm trying a similar network to stockfish nnue with the only difference that I have added the castling rights as input.
The learning rate is 0.0001, I have tried various and it seems the best.
I could try a deeper search but the problem is that the error is quite high and after 200M positions the error doesn't seem to lower a lot.
I've tried also different starting weights with small difference in the training results.

chrisw · Post by **chrisw** » Tue Sep 01, 2020 6:13 pm

Fabio Gobbato wrote: ↑Tue Sep 01, 2020 2:25 pm I'm implementing a neural network that should replace the evaluation of my engine.
I've made a dataset of 200M fens with associated the score of a 4 depth search of my engine.
For the optimization algorithm I have used Adagrad and Gradient descent, the loss function is the sum squared error.
I have trained the net with these methods and I've always get a big error.
The net trained in this way gives a mean error of 200cp that is very high.
How can I improve the training?
I have tried to shuffle the positions in different iterations but it gives a very little improvement. Adagrad compared to a simple Gradient descent gives similar results.
Do you have any suggestion to lower the error of the neural network?

Is unclear the range of centipawn values your training and test sets have, but, if you were training on win probability (0.0 to 1.0 range) then a completely untrained (random) net is going to come back with an average “error” of about 0.3 which could well translate to your 200 cp. So it’s entirely possible the training process is completely broken somewhere, somehow - meaning check first that there’s any training at all taking place before looking for improvements. Some architectures simply are not able to unravel any sense at all out of data presented, or there may just be a bug in the training code. Suggest you substitute in a known working input format, a known working layer architecture, known working training rate and so on and if that works try substituting in your own way to organise the NN.

Ah! Edit, you’re using similar to NNUE, so disregard the above suggestion. Are you able to measure the “error” of the functional NNUE? It would at least give you some sort of target to be aiming at.

brianr · Post by **brianr** » Tue Sep 01, 2020 6:39 pm

I am very familiar with Leela type networks and far less so with the NNUE nets.
That said, your LR seems extremely low for initial training (although no batch size was mentioned).

I suggest asking in the SF-NNUE Discord but I'm not sure how joining that works.

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Sep 01, 2020 8:29 pm

chrisw wrote: ↑Tue Sep 01, 2020 6:13 pm Is unclear the range of centipawn values your training and test sets have, but, if you were training on win probability (0.0 to 1.0 range) then a completely untrained (random) net is going to come back with an average “error” of about 0.3 which could well translate to your 200 cp. So it’s entirely possible the training process is completely broken somewhere, somehow - meaning check first that there’s any training at all taking place before looking for improvements. Some architectures simply are not able to unravel any sense at all out of data presented, or there may just be a bug in the training code. Suggest you substitute in a known working input format, a known working layer architecture, known working training rate and so on and if that works try substituting in your own way to organise the NN.

Ah! Edit, you’re using similar to NNUE, so disregard the above suggestion. Are you able to measure the “error” of the functional NNUE? It would at least give you some sort of target to be aiming at.

I don't think there is a bug because if I train the net with only 10 positions the error go down to 0. The problem comes when I train with 200M positions.
I have tried also with win probability and the error drops to 10% that is more or less a pawn, and it's difficult to get further improvements.

chrisw · Post by **chrisw** » Tue Sep 01, 2020 9:12 pm

Fabio Gobbato wrote: ↑Tue Sep 01, 2020 8:29 pm
chrisw wrote: ↑Tue Sep 01, 2020 6:13 pm Is unclear the range of centipawn values your training and test sets have, but, if you were training on win probability (0.0 to 1.0 range) then a completely untrained (random) net is going to come back with an average “error” of about 0.3 which could well translate to your 200 cp. So it’s entirely possible the training process is completely broken somewhere, somehow - meaning check first that there’s any training at all taking place before looking for improvements. Some architectures simply are not able to unravel any sense at all out of data presented, or there may just be a bug in the training code. Suggest you substitute in a known working input format, a known working layer architecture, known working training rate and so on and if that works try substituting in your own way to organise the NN.

Ah! Edit, you’re using similar to NNUE, so disregard the above suggestion. Are you able to measure the “error” of the functional NNUE? It would at least give you some sort of target to be aiming at.
I don't think there is a bug because if I train the net with only 10 positions the error go down to 0. The problem comes when I train with 200M positions.
I have tried also with win probability and the error drops to 10% that is more or less a pawn, and it's difficult to get further improvements.

10% is quite good

Fabio Gobbato · Post by **Fabio Gobbato** » Wed Sep 02, 2020 10:56 am

brianr wrote: ↑Tue Sep 01, 2020 6:39 pm I am very familiar with Leela type networks and far less so with the NNUE nets.
That said, your LR seems extremely low for initial training (although no batch size was mentioned).

I suggest asking in the SF-NNUE Discord but I'm not sure how joining that works.

I update the weights after every sample so the batch size is 1, I'm not sure if it could be a problem.

Train a neural network evaluation

Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation

Re: Train a neural network evaluation