How do NNUEs self train?

op12no2 · Post by **op12no2** » Fri Feb 16, 2024 10:12 am

eboatwright wrote: ↑Thu Feb 15, 2024 8:11 pm Although after thinking it through some more, I think I might just end up learning PyTorch to train the network, all this derivative calculating and back-propagation is going waaaayyy over my head

I found this down-to-earth article very useful:-

https://alexander-schiendorfer.github.i ... kprop.html

lithander · Post by **lithander** » Fri Feb 16, 2024 11:34 am

eboatwright wrote: ↑Thu Feb 15, 2024 5:14 pm I'm doing this to learn, so I'd like to write all the training code myself.

I did this excellent online course once: https://www.coursera.org/learn/machine-learning-course/
Right afterwards I implemented a simple artificial neural network from scratch based on what I learned. It identifies the correct class [0..9] of handwritten digits from a dataset of handwritten numerals. I tried to keep it as minimal as possible (one file, 300 lines of C# code) which is a great exercise if you want to see if you really understood a concept. (I also implemented a tiny, slow Bitcoin miner. And MinimalChess was also supposed to be like that but... well...

)

Anyways, the problem is that 10 years later these projects are just a distant memory. You only memorize basic concepts but lose the ability to implement it all from scratch. It's a practically pointless skill anyway because we have powerful higher level APIs at our disposal and never need to reinvent that particular wheel.

From the chess programming standpoint the NN architecture is really important and making the inference fast is also important. Chess programmers have spent a lot of thought on their labeled data (generation and filtering methods) and on the specific network architecture they want to use. Now they just need a tool to solve that computationally big but fairly unexciting optimization problem of getting the weights. This is why many engines share the same GPU trainers.

eboatwright · Post by **eboatwright** » Fri Feb 16, 2024 4:10 pm

op12no2 wrote: ↑Fri Feb 16, 2024 10:12 am
eboatwright wrote: ↑Thu Feb 15, 2024 8:11 pm Although after thinking it through some more, I think I might just end up learning PyTorch to train the network, all this derivative calculating and back-propagation is going waaaayyy over my head
I found this down-to-earth article very useful:-

https://alexander-schiendorfer.github.i ... kprop.html

Thanks! I'll check it out

eboatwright · Post by **eboatwright** » Fri Feb 16, 2024 4:38 pm

lithander wrote: ↑Fri Feb 16, 2024 11:34 am
eboatwright wrote: ↑Thu Feb 15, 2024 5:14 pm I'm doing this to learn, so I'd like to write all the training code myself.
I did this excellent online course once: https://www.coursera.org/learn/machine-learning-course/
Right afterwards I implemented a simple artificial neural network from scratch based on what I learned. It identifies the correct class [0..9] of handwritten digits from a dataset of handwritten numerals. I tried to keep it as minimal as possible (one file, 300 lines of C# code) which is a great exercise if you want to see if you really understood a concept. (I also implemented a tiny, slow Bitcoin miner. And MinimalChess was also supposed to be like that but... well... )

Anyways, the problem is that 10 years later these projects are just a distant memory. You only memorize basic concepts but lose the ability to implement it all from scratch. It's a practically pointless skill anyway because we have powerful higher level APIs at our disposal and never need to reinvent that particular wheel.

From the chess programming standpoint the NN architecture is really important and making the inference fast is also important. Chess programmers have spent a lot of thought on their labeled data (generation and filtering methods) and on the specific network architecture they want to use. Now they just need a tool to solve that computationally big but fairly unexciting optimization problem of getting the weights. This is why many engines share the same GPU trainers.

Yeah, I think I've learned alot of "objectively useless" skills over the years

but I always end up learning something else related that ends up being useful
E.g. I didn't expect making a Chess engine to be so educational, but here I am!

Although I think now that I understand most of the basic concepts of NN training, I'm ok with using a higher-level framework, although I haven't completely given up! I'm gonna try to use NumPy to write a trainer in Python, before falling back onto PyTorch or something related

eboatwright · Post by **eboatwright** » Fri Feb 16, 2024 7:23 pm

I do have another question though... if in self-play you're training the AI to predict a win, loss or draw (1, -1 or 0) how does that then get converted into centipawns?
Especially if you're using an activation function on the output layer

lithander · Post by **lithander** » Sat Feb 17, 2024 12:17 pm

Maybe you shouldn't start with neural networks right away but with "texel tuning" some piece-square tables from labeled positions. It uses the same general principles like gradient descent but without hidden layers you don't need backpropagation. It's simpler and thus easier to understand and write from scratch!

It's also easily and quickly computable on the CPU. Many engines (including Leorik) do that kind of tuning for their HCE. Skipping that step makes it harder to really understand NNUE evals imo.

Generally, you'll wanted to minimize the error between the prediction of your network/HCE-eval and the labels over the entire training dataset. So you typically convert from the linear eval scale (centipawns) into winning propabilities.

Code: Select all

double MeanSquareError(List<Data> data, float scalingCoefficient)
{
	double squaredErrorSum = 0;
	foreach (Data entry in data)
	{
		var eval = Evaluation(entry.Position);
		float error = entry.Result - Sigmoid(value, scalingCoefficient);
		squaredErrorSum += error * error;
	}
	return squaredErrorSum / data.Count;
}

You do can do that with sigmoid functions like this:

Code: Select all

float Sigmoid(float eval, float scalingCoefficient)
{
	//maps eval given in centipawn to winning propabilities [-1..1]
	return 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
}

eboatwright · Post by **eboatwright** » Sat Feb 17, 2024 5:10 pm

lithander wrote: ↑Sat Feb 17, 2024 12:17 pm Maybe you shouldn't start with neural networks right away but with "texel tuning" some piece-square tables from labeled positions. It uses the same general principles like gradient descent but without hidden layers you don't need backpropagation. It's simpler and thus easier to understand and write from scratch!

It's also easily and quickly computable on the CPU. Many engines (including Leorik) do that kind of tuning for their HCE. Skipping that step makes it harder to really understand NNUE evals imo.

Generally, you'll wanted to minimize the error between the prediction of your network/HCE-eval and the labels over the entire training dataset. So you typically convert from the linear eval scale (centipawns) into winning propabilities.
Code: Select all
double MeanSquareError(List<Data> data, float scalingCoefficient)
{
	double squaredErrorSum = 0;
	foreach (Data entry in data)
	{
		var eval = Evaluation(entry.Position);
		float error = entry.Result - Sigmoid(value, scalingCoefficient);
		squaredErrorSum += error * error;
	}
	return squaredErrorSum / data.Count;
}
You do can do that with sigmoid functions like this:
Code: Select all
float Sigmoid(float eval, float scalingCoefficient)
{
	//maps eval given in centipawn to winning propabilities [-1..1]
	return 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
}

Yeah I knew about Texel Tuning, but I just thought it'd be way more fun to tackle NNUE: I understand most of how it all works, I just found this series by "The Coding Train" and watched the sections about back-prop, and I'm actually really close to getting it working (He explains things like he's explaining to a 5 year old, and that was just enough to get it into my head lol)

That's interesting....
So then the error is applied along every PST value, just like you apply gradients to weights?

And I might be getting this wrong, here's what I got from this:
during training, the NN's output is fed through sigmoid (but offset so it's between -1 and 1 instead of 0 and 1) and is a win probability, because your training it off the self-play data, which gives the game's outcome, instead of an individual evaluation

(Simple enough, you're training the network to predict the outcome of the game)

But then in the engine, the sigmoid function is either removed and the output is multiplied by a constant scaling value, or mapped from win probability to centipawns by doing the opposite of this Sigmoid function you gave:

Code: Select all

float Sigmoid(float eval, float scalingCoefficient)
{
	//maps eval given in centipawn to winning propabilities [-1..1]
	return 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
}

Thanks again for all your help! Sorry if that's just completely wrong and I completely misunderstood haha

mvanthoor · Post by **mvanthoor** » Thu Feb 22, 2024 1:19 pm

lithander wrote: ↑Fri Feb 16, 2024 11:34 am I did this excellent online course once: https://www.coursera.org/learn/machine-learning-course/
Right afterwards I implemented a simple artificial neural network from scratch based on what I learned. It identifies the correct class [0..9] of handwritten digits from a dataset of handwritten numerals.

I'm going to save that link. I know how genetic algorithms work and I know the basics of neural networks, but for the last one my knowledge is sketchy. It's too long ago I studied this in university. If you can implement something after watching this course, it should be a good one.

lithander · Post by **lithander** » Thu Feb 22, 2024 2:23 pm

mvanthoor wrote: ↑Thu Feb 22, 2024 1:19 pm
lithander wrote: ↑Fri Feb 16, 2024 11:34 am I did this excellent online course once: https://www.coursera.org/learn/machine-learning-course/
Right afterwards I implemented a simple artificial neural network from scratch based on what I learned. It identifies the correct class [0..9] of handwritten digits from a dataset of handwritten numerals.
I'm going to save that link. I know how genetic algorithms work and I know the basics of neural networks, but for the last one my knowledge is sketchy. It's too long ago I studied this in university. If you can implement something after watching this course, it should be a good one.

It's a proper university-level course with assignments and all. But everything was rather highlevel using matlab.

My implementation is in C#, no dependencies and ~300 lines of code:
https://github.com/lithander/Minimal-Ne ... Program.cs

hgm · Post by **hgm** » Thu Feb 22, 2024 7:31 pm

Back-propagation is not very mysterious, is it? It is just a trick to speed up the thing that you want: change the weights in proportion to how much a (very small) change in each weight would change the output in the right direction (i.e., to the sensitivity of the output for that weight). So if the output is 9 times more sensitive to w1 than to w2, w1 will be adjusted 9 times as much as w1 in order to push the output in the desired direction.

The point is that in a layered NN, a change in a weight of an early layer would have to propagate through many subsequent layers to reach the output. E.g. if you change a weight in layer 1, (for a given input of the NN), this would affect the output of one cell in layer 1, and that would again affect the outputs of all cells in layer 2 that get input from the layer-1 cell. When you change another weight connecting to the same cell in layer 1 (but from another input), it would also change the output of that same layer-1 cell (but by a different amount), and you would have to propagate that change again through the entire network. And you would repeat that over and over again.

So it is much faster to calculate first once and for all how a small change in the output of that layer-1 cell would change the NN-output, and then just assume that a twice larger change would also cause twice the effect at the output, and half the change only half the effect, etc. If the changes are small enough, that becomes a good approximation. Then instead of doing the propagation from that output of layer 1 through all the subsequent layers, you just do a single multiplication of that change with the sensitivity for such a change that you calculated before. That saves tons of calculations.

So you just work your way back from the output of the NN. You look at the last step in the calculation of that output, to see what variables went into that, and determine the sensitivity of the output value to a miniscule change in each of the values it was based on. That (i.e. the ratio of the output change and the input change) is the sensitivity to that variable 'A'. Then you go one step back through the network, and look how that variable 'A' (and all its brethren on which the output also depended directly) were calculated. And you calculate how sensitive each variable A was to changes in the variables 'B' that it was based on. And you just multiply that with the sensitivity of the overall output to A., to get how sensitive that output is to variable B. And so on, and so on. Once you have calculated how sensitive the output is to some intermediate result of the calculation, you never would have to propagate the effect of an upstream change byond that point anymore.

At the end of this you will have the sensitivity of the output to all the NN inputs (which is of no importance, since this input was a given), and to all the weights. The latter information you use to tune the weights.

How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?

Re: How do NNUEs self train?