NNUE accessible explanation

fierz · Post by **fierz** » Tue Jul 21, 2020 9:48 pm

I just came across the thread on Stockfish NNUE in the general discussion - sounds interesting! Does anyone know about an accessible description of the algorithm/NN structure? I saw there is a paper in Japanese but deepL is not very good yet at translating Japanese I noticed...

jorose · Post by **jorose** » Thu Jul 23, 2020 4:12 pm

Hi Martin, cool to see you around here again!

It is indeed a bit difficult to find non-Japanese sources for NNUE and it put me off a bit in the passed.

You can find a nice compendium of sources here. Still a lot of them are in Japanese, but I still think its quite useful.

Based on my understanding the main trick is to store the position in the form of KingSquarexPieceTypexSquare for each side. So you have a boolean input for whether there is a black queen on d3 and white king on f5, for example. There are 5 non-king pieces times two for the two sides giving us 10 piecetypes total. This gives us 40'960 inputs per side, of which at most 32 are non-zero. When a piece moves you can incrementally update the output for the input of the first hidden layer.

This is a bit simplified as I know there are more inputs. Notably, my understanding is nodchip also has as input for each square whether a capture occured there.

This calculation is done for each side giving you n outputs for white and n outputs for black. Depending on whether it is white or black to more the outputs are concatenated differently. So either [white outputs, black outputs] or [black outputs, white outputs]. The rest of the network is then just a standard fully connected feedforward network.

This encoding allows for the first layer to be calculated very efficiently while still having a large amount of weights and the most important relation of the pieces (to the two respective kings!) being stored explicitly.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Jul 23, 2020 5:48 pm

fierz wrote: ↑Tue Jul 21, 2020 9:48 pm I just came across the thread on Stockfish NNUE in the general discussion - sounds interesting! Does anyone know about an accessible description of the algorithm/NN structure? I saw there is a paper in Japanese but deepL is not very good yet at translating Japanese I noticed...

Not sure where this paper was mentioned in Stockfish NNUE discussions, maybe TCEC chat:
Tunable Efficient Unitary Neural Networks (EUNN) and their application toRNNs

fierz · Post by **fierz** » Fri Jul 24, 2020 7:58 pm

Hi Jonathan and Gerd,

thanks for your replies!
@Jonathan: thanks for the explanation with the concatenation of the inputs in different orders; I saw a graph in the general discussion forum which showed that something was different depending on which side was to move, and I didn't understand how this was done, but now it is clear.
One of the things I never could understand is how people choose a NN topology - number of layers, number of nodes per layer, fully connected or convolutional with stride N, whether to include a Relu type layer, and what type (relu or sigmoid etc) etc. It all seems so arbitrary...

cheers
Martin

brianr · Post by **brianr** » Fri Jul 24, 2020 9:07 pm

It is a combination of art and science.

Very little is actually known about what works absolutely best, and things are still in the early stages.
For some years there have been NN engines with nets built around specific chess info like attack maps and such (like Giraffe).
With AZ the more generic net approach was shown to be very good.

The high-level idea is that resnets (a type of net arch) are becoming very good at image processing.
So, think of the chess board like an image. Instead of only 3 colors and lots of image pixels, chess has many "image colors" like for each piece and side, and far fewer image size pixels (8x8). But the magic is that the nets manage to learn quite a lot about chess anyway.

The latest thing is SF-NNUE which I think uses fully connected layers which are somewhat different yet again.
It's all moving very fast and there is more excitement now in comp chess than there has been for many years.

But, which net arch, how many layers, what size, how long to train, what learning rate to use, etc. are all still very much being experimented with. It might be fair to say it is like the medieval age of alchemy more than what chemistry is today.

jorose · Post by **jorose** » Fri Jul 24, 2020 10:22 pm

I think in general a lot of why certain architectures are chosen over others boils down to "because we tried others and this worked the best". That's not to say there is no justification, just that it's not black and white. Perhaps, however, I can give you a bit of my intuition with different aspects.

Convolutional Networks and Stride
Generally, when working in the image space, convolutional neural networks are superior to fully connected networks, because they are easier to train (they have some implicit regularization due to weight re-use) and their modularity makes them more suitable for GPU computation. Since chessboards are very small (8x8 vs 1080x1920 for a 1080p image) it's less clear that this is the case for our game. Convolutional networks usually use fully connected layers at the end as they try to move from the image space to a more general feature space. Stride > 1 is mostly a tool for dimensionality reduction. For chess this is pretty much useless, as there is little reason to want to reduce to 4x4 or smaller. I would much sooner just switch to fully connected layers.

Filter Size
For general problems filter size selection is a slightly more interesting question, but even there 3x3 filter is very popular, so you should have a very good reason if you are doing something else. I actually have experimented with other values for this in Winter, but I'm not in the mood for getting into those details at the moment.

Activation Functions
There is a lot of research trying to find the most optimal activation function. The goal is usually to find better gradients in order to be able to train even larger and more complex neural networks. In recent literature people have been training with hundreds of layers. OpenAI recently released information about their latest GPT-3 natural language model. It has 175 billion parameters, which would take 700GB to store assuming 32 bit precision. This is all to say the research being done is mostly to push boundaries and solve problems you probably do not have. A solid rule of thumb is to use relu for all layers with exception of the output layer. The output layer should be one of linear, sigmoid, softmax, tanh or relu, depending on what the data demands. Relu is a very good activation function as it is very efficient to compute and suffers much less from the vanishing gradient problem when compared to sigmoid.

Layer Size and Neurons Per Layer
The optimal number of layers and number of neurons per layer is going to be very problem specific. More layers increases computational complexity and the number of parameters linearly in the number of layers while simultaneous allowing for the neurons to contain higher level information. The main downside is more layers tend to be quite a bit harder to train. This is more problematic for fully connected networks than for convolutional networks, but is definitely problematic in both cases. More neurons per layer will increase the number of parameters in a quadratic way. If your network is very small, an increase in the number of units may not have too big an impact. For larger networks, I believe computational complexity will increase quadratically as well, but don't quote me on that. Important is that you need enough units to represent the information you want the network to learn.

Overparametrization
As a side note, most state of the art neural networks are heavily overparametrized. Since for most problems, we only care about reducing the horrible training times, there is not as much work on reducing this issue. Thanks to a desire to have neural networks on mobile devices and the progressively larger networks we are able to train this has changed a bit, but nevertheless, for chess this is actually much more important. We care very much about inference time as better inference time implies more nodes per second for our search algorithm. The size of the network and the inference time are related, but its not a one to one relation.

AlphaZero and LC0 Network Architectures
AlphaZero and LC0 are both based on the well known ResNet neural network architecture. AlphaZero introduced a dual headed output for policy and evaluation. The LC0 team extended to use the SE feature and I would imagine many other ideas I missed as I haven't been following too closely. At their heart however, the architectures are quite standard for image recognition.

NNUE Architecture
The NNUE network, in my understanding, is very non-standard and completely designed for being efficient for the task at hand. The input layer is heavily overparametrized which is normally a bad thing, but due to the known sparsity, it is actually very efficient to compute. The number of layers and neurons after that is kept low, in order too much computational burden. This makes it extremely fast to compute relative to the LC0 network while still having a fairly high amount of power.

Winter Architecture
As a final note, the Winter NN has two main parts. The first part is a non-standard convolutional neural network which uses sparsity similarly to the NNUE network. This convolutional network is used to calculate pawn structure features, so the output can be reused very often as it gets stored in a separate hash table with a high hitrate. The second part is a fully connected network which has as input the output of the convolutional network as well as a set of handcrafted features standard to classical engines such as SF. This set of features is mostly a subset of the features from before I added neural networks to Winter.

Hopefully this clarifies a lot of questions. As this actually took some time to type up I might further extend and clean this up to make a separate post for people interested in getting into neural networks for chess-like games.

fierz · Post by **fierz** » Fri Jul 24, 2020 11:14 pm

Dear Jonathan,

thanks for the very detailled explanations, they are much appreciated!
I will contribute to your idea of writing something for newbies to NNs by asking some more stupid newbie questions which you can then answer and make an even better introduction

1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?

2) Do I code my NN myself in C or do I use libraries for this? If libraries, which ones (BLAS?)? It would seem to me that it is very easy to implement the NN code "by hand" (just a few matrix multiplications) but perhaps it will be inefficient if not using the right libraries?

3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100). When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

4) From reading about Stockfish NNUE I get the impression that they are not doing a regression vs game results, but rather vs the evaluation of a Stockfish search to X ply and try to learn that rather than the game result, which is different then the Texel tuning method. Is this distincition of trying to learn search eval vs trying to learn game results actually relevant or are the two +/- equal?

Sorry for the really stupid questions... but perhaps other people have them too...

best regards
Martin-who-didn't-realize-you-were-in-Indiana! Hope you are doing well there!

fabianVDW · Post by **fabianVDW** » Sat Jul 25, 2020 11:25 am

I can chime in on two of the points and maybe provide a bit of a mathematical background, although I am unsure how much of this you know already.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?

There are several ways to go about this, and it all depends on what you are training your net on. What you describe is what can be done during training, you could also do this with a sigmoid which ranges from 0 to 1, and interpret the output as a probability. If you want to use the "Centipawn" score or any other linear score internally for your engine, you would then need to take the inverse sigmoid/tanh of your output, which is the same as doing no activation at all for the last layer. So for inference it might make sense to turn off the activation of the last layer. It is to be noted here that for centipawn, it might make sense to multiply the inverse sigmoid of the output by a factor of K, where K can range from 50-200. Another approach is to have a Win-Draw-Loss output which is done with 3 neurons and a softmax activation(don't quote me on that, not sure), so all three probabilities would be predicted independently, which has some advantages over only having one probability representing the expected game outcome.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100).

Texel's tuning method is actually very similar to training a dense neural network. You can think of your evaluation as a one-layer neural network with no activation function for the most part, and one output neuron(or maybe two, if you do a phased evaluation). Of course, most engines have some non linear stuff in their evaluation, most typically found in attack evaluation.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

The gradient is essentially the same as the derivative. From what I learned, by definition the gradient is the transpose of the derivative vector, but I know there are definitions where it is exactly the same. You can think of your NN(with a fixed input) as a function F: R^n -> R, where R are the real numbers, and n is the dimension, so the amount of weights of your network. Then F is differentiable for the most part, and the gradient of F at a vector x in R^n (denoted as (Grad F)(x)) will be showing in the direction of steepest ascent, which is easy to prove from cauchy-schwarz inequality and definitions of a derivative. Similarly, -1 * (Grad F)(x) will show in the direction of the steepest descent. This is what justifies the gradient descent method. Now there are two ways of calculating the gradient.

What you are doing is approximating the gradient by the difference qoutient and the fact that (Grad F)(x) = ( (partial_1 F)(x), ..., (partial_n F)(x))^T.
So the definition of (partial_j F)(x) = lim as s-> 0 of (F(x + s*e_j) - F(x))/ s, where e_j is a unit vector in R^n. What you are doing is essentialy approximating this limit by explicitly calculating (F(x+1 * e_j) - F(x)) / 1, so plugging in s = 1( or other values for s). This works pretty well in practice, but has two disadvantages:
1) This approach scales linearly in dimension n (because for each partial derivative you will have to recalculate F(x+ s*e_j). This is particulary bad for NN, where parameter sizes usually explodes.
2) It is just an approximation, and especially for Neural Networks there may be some numerical instability for doing this for instance in the first layer, while your network has, say 8-9 layers.
3) A big oversight in this approach is the following: "Why is F even differentiable?" If you ask yourself this question, you will find out that with this approach you will not think about this. You can "approximate" the "gradient" of any function like that, even if it does not make sense at all, because not any function is differentiable.

What is done instead is calculating the Gradient of F analytically with derivative rules, so the chain rule, addition rule and so forth, taking advantage of the fact that we can formulate F(x) analytically with sums(or matrix multiply). Using the chain rule, you will find that you will indeed have to take the derivative of the Relu Function, which is easier to calculate than the derivative of sigmoid. You can look up the formulas or try to come up with them yourselves, but if you use a framework like Tensorflow/Keras or pytorch to train a network, you will not be confronted with this, and all your left to do might be implementing the inference in the language of your choice.

jorose · Post by **jorose** » Mon Aug 03, 2020 6:53 am

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?

That is pretty much the standard approach for a single output value. For training purposes you can consider using sigmoid instead of tanh and then switching to either tanh or just using the logit without activation for gameplay. In this case you would use cross entropy loss and assign 1/0.5/0 as outputs for WDL.

For multiple output values (eg W/D/L) the standard approach would be to use a softmax activation function paired with cross-entropy loss. In Winter I have two outputs (W/W+D which trivially imply W/D/L) which are activated by sigmoid function with cross entropy loss. Whether this non-standard approach is still worth it at the moment, I cannot say.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 2) Do I code my NN myself in C or do I use libraries for this? If libraries, which ones (BLAS?)? It would seem to me that it is very easy to implement the NN code "by hand" (just a few matrix multiplications) but perhaps it will be inefficient if not using the right libraries?

For training I would recommend using some high level library such as TensorFlow 2.0 (which I currently use) or Pytorch. This has a couple of benefits. You can easily experiment with different architectures containing any number of state of the art features you might not want to implement unless the architecture looks like a clear improvement. It is a natural sanity check as you can compare the output of a network in your engine as well as from the library for some fixed input. Finally, you don't need to implement any training algorithms yourself. To me at least, the training algorithms are significantly more complicated than just implementing support for inference, so it has a bigger potential for bugs as well as potentially making your codebase bigger and convoluted.

As for inference, different projects rely on support on different levels.

I don't know any projects which rely on high level libraries directly for neural network support. I think that is a fine route if you don't need the absolute maximum of performance and customization options. The main downside is that something like NNUE, where you are doing some unusual tricks, might be hard or even impossible to do. Furthermore, you might not actually save a lof of work if you are working with a low level language. I find documentation a bit lacking for C/C++ in combination with higher level libraries and most of the functions you have to implement for inference are not actually that complicated anyways.

On GPU, LC0 relies on Cuda and CUDNN or alternatively on OpenCL. If you want to support GPUs, I would suggest you do the same. Support for GPUs is likely not worth it for you if you implement some NNUE based system or just a small fully connected network.

On CPU, LC0 relies on Blas. Afaik Blas backends can be used interchangeably and share the same basic interface.

Allie relies on the LC0 code for its Neural Network implementation. I think this has been a minor point of discussion in terms of debates on cloning, but that is outside of the scope of this thread. I mainly mention it here, because all engines I am mentioning in this post are open source and it might give you another reference point.

I don't think SF-NNUE relies on Blas, instead working directly with optimized code and SIMD CPU instructions.

Giraffe relies on Eigen. I am not really familiar with Eigen aside from knowing that it is a general purpose linear algebra library. If we are lucky, Matthew Lai could pop in and explain its benefits and whether he would recommend using that.

Winter doesn't rely on an external backend for inference and in fact doesn't even explicitly use SIMD instructions. The primary reason being that the last time I checked the assembly code, the compiler seemed to be doing that reasonably well. Somewhere in the middle of my TODO list is to add explicit support for SIMD instructions, so I don't have to trust the compiler anymore.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100). When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

Based on your writing I think you are close. Perhaps you should read up some more on backpropogation as you should be calculating the derivatives at every layer. Backpropogation is more or less just a clever and optimized use of the chain rule for derivatives.

It might help to write out the neural network in the form of L(A(B(C(x))),y) where L is the loss function, A,B and C are the layers with their respective weight parameters, x is the input and y is the target label. Then think about how you would calculate the derivative to update the parameters in A, B and C respectively.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 4) From reading about Stockfish NNUE I get the impression that they are not doing a regression vs game results, but rather vs the evaluation of a Stockfish search to X ply and try to learn that rather than the game result, which is different then the Texel tuning method. Is this distincition of trying to learn search eval vs trying to learn game results actually relevant or are the two +/- equal?

The distinction is somewhat relevant. For more information you can read up on temporal difference learning, which is more generally what SF-NNUE is doing. Giraffe did another form of this and I have previously done temporal difference learning in Winter as well.

The main advantages are that such an approach is less data reliant (you can train with positions where you don't have a game outcome) and your labels could potentially end up being more accurate. E.g. game result alone cannot really quantify if a position is unbalanced, but dynamically equal.

Temporal difference learning and other forms or reinforcement learning have their own issues though and I would recommend at least starting out with pure supervised learning of game outcomes. Once you know that is working, you can try out temporal difference learning or self play for comparison.

At some point the best in Winter was labelling positions as some weighted average between a low depth search and the actual game outcome.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm Martin-who-didn't-realize-you-were-in-Indiana! Hope you are doing well there!

Indeed I am in Indiana and doing ok. I just moved out of my previous apartment and will be moving into my new one after a brief stay with friends. Hopefully I will be able to visit everyone back home for a month or so around Christmas!

Milos · Post by **Milos** » Mon Aug 03, 2020 8:21 am

jorose wrote: ↑Mon Aug 03, 2020 6:53 am As for inference, different projects rely on support on different levels.

I don't know any projects which rely on high level libraries directly for neural network support. I think that is a fine route if you don't need the absolute maximum of performance and customization options. The main downside is that something like NNUE, where you are doing some unusual tricks, might be hard or even impossible to do. Furthermore, you might not actually save a lof of work if you are working with a low level language. I find documentation a bit lacking for C/C++ in combination with higher level libraries and most of the functions you have to implement for inference are not actually that complicated anyways.

On GPU, LC0 relies on Cuda and CUDNN or alternatively on OpenCL. If you want to support GPUs, I would suggest you do the same. Support for GPUs is likely not worth it for you if you implement some NNUE based system or just a small fully connected network.

On CPU, LC0 relies on Blas. Afaik Blas backends can be used interchangeably and share the same basic interface.

Allie relies on the LC0 code for its Neural Network implementation. I think this has been a minor point of discussion in terms of debates on cloning, but that is outside of the scope of this thread. I mainly mention it here, because all engines I am mentioning in this post are open source and it might give you another reference point.

I don't think SF-NNUE relies on Blas, instead working directly with optimized code and SIMD CPU instructions.

Giraffe relies on Eigen. I am not really familiar with Eigen aside from knowing that it is a general purpose linear algebra library. If we are lucky, Matthew Lai could pop in and explain its benefits and whether he would recommend using that.

Nice write-up Johnatan.
Here are a couple of additional clarifications.
SF-NNUE indeed uses handcrafted SIMD code mainly because it has a lot of sparse update operations. Here is how it actually looks like:
https://github.com/nodchip/Stockfish/bl ... nsformer.h

Eigen is very good for more complex operations (beyond winograd convolutions, matrix-matrix multiply and dot products) on a single thread. On multi-threads nothing much except general matrix-matrix products is implemented.
In general using Blas as backend is a much safer bet.

P.S. Btw. do you by any chance know Celestine Dünner from your ETHZ days?

NNUE accessible explanation

NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation