How do NNUEs self train?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
eboatwright
Posts: 41
Joined: Tue Jan 09, 2024 8:38 pm
Full name: E Boatwright

Re: How do NNUEs self train?

Post by eboatwright »

hgm wrote: Thu Feb 22, 2024 7:31 pm Back-propagation is not very mysterious, is it? It is just a trick to speed up the thing that you want: change the weights in proportion to how much a (very small) change in each weight would change the output in the right direction (i.e., to the sensitivity of the output for that weight). So if the output is 9 times more sensitive to w1 than to w2, w1 will be adjusted 9 times as much as w1 in order to push the output in the desired direction.

The point is that in a layered NN, a change in a weight of an early layer would have to propagate through many subsequent layers to reach the output. E.g. if you change a weight in layer 1, (for a given input of the NN), this would affect the output of one cell in layer 1, and that would again affect the outputs of all cells in layer 2 that get input from the layer-1 cell. When you change another weight connecting to the same cell in layer 1 (but from another input), it would also change the output of that same layer-1 cell (but by a different amount), and you would have to propagate that change again through the entire network. And you would repeat that over and over again.

So it is much faster to calculate first once and for all how a small change in the output of that layer-1 cell would change the NN-output, and then just assume that a twice larger change would also cause twice the effect at the output, and half the change only half the effect, etc. If the changes are small enough, that becomes a good approximation. Then instead of doing the propagation from that output of layer 1 through all the subsequent layers, you just do a single multiplication of that change with the sensitivity for such a change that you calculated before. That saves tons of calculations.

So you just work your way back from the output of the NN. You look at the last step in the calculation of that output, to see what variables went into that, and determine the sensitivity of the output value to a miniscule change in each of the values it was based on. That (i.e. the ratio of the output change and the input change) is the sensitivity to that variable 'A'. Then you go one step back through the network, and look how that variable 'A' (and all its brethren on which the output also depended directly) were calculated. And you calculate how sensitive each variable A was to changes in the variables 'B' that it was based on. And you just multiply that with the sensitivity of the overall output to A., to get how sensitive that output is to variable B. And so on, and so on. Once you have calculated how sensitive the output is to some intermediate result of the calculation, you never would have to propagate the effect of an upstream change byond that point anymore.

At the end of this you will have the sensitivity of the output to all the NN inputs (which is of no importance, since this input was a given), and to all the weights. The latter information you use to tune the weights.
Thanks for the explanation! I got it working a couple days ago, although I wrote my own Matrix class instead of using NumPy, so I might even just port it to Rust for the extra speed, and I'm currently tinkering around with training parameters.
I think I've got something good enough: it's trained on ~7 million self-play positions so far, I'll let it run for a few more days and then try it out :D
Creator of Maxwell