How to get started with NNUE

arjunbasandrai · Post by **arjunbasandrai** » Fri Jan 12, 2024 1:54 pm

Hello Chess Enthusiasts!

I hope this post finds you well. I am currently developing the Shuffle Chess Engine and have recently made the decision to integrate NNUE (Efficiently Updatable Neural Network) into my project. I'm facing some challenges in finding the right resources to get started.

I am particularly interested in gaining insights into the required position encoding for NNUE and understanding the architecture-building process. It seems that NNUE operates differently from conventional neural networks, and the concept of horizontally mirrored buckets is still a bit unclear to me.

If any of you have experience with NNUE implementation or can point me towards valuable resources, I would be immensely grateful. Whether it's tutorials, documentation, or insightful forum threads, any information that can aid me in comprehending and effectively implementing NNUE in my Shuffle Chess Engine would be highly appreciated.

Thank you in advance for your time and assistance. I look forward to your valuable insights!

Best regards,
Arjun Basandrai

smatovic · Post by **smatovic** » Fri Jan 12, 2024 5:06 pm

I can recommend Neural Networks for Ches by Dominik Klein:

https://github.com/asdfjkl/neural_network_chess

covers Lc0 CNN and SF NNUE (chapter 4.6).

Stockfish made since version 12 several different network architecture iterations.

--
Srdja

Ciekce · Post by **Ciekce** » Sat Jan 13, 2024 2:57 am

I highly recommend asking questions in the stockfish or engine programming discords - there's plenty of people in both who can provide a lot of help (especially on this specific topic, there are a lot of new NNUE engines recently), and they're much better sources of up-to-date information than talkchess

sf: https://discord.gg/GWDRS3kU6R
engine programming: https://discord.gg/YctB2p4

mvanthoor · Post by **mvanthoor** » Sat Jan 13, 2024 4:30 pm

smatovic wrote: ↑Fri Jan 12, 2024 5:06 pm I can recommend Neural Networks for Ches by Dominik Klein:

https://github.com/asdfjkl/neural_network_chess

covers Lc0 CNN and SF NNUE (chapter 4.6).

Stockfish made since version 12 several different network architecture iterations.

--
Srdja

Thanks for linking that. I didn't know about it. I saved it as it may be useful in the future

JacquesRW · Post by **JacquesRW** » Sat Jan 13, 2024 7:55 pm

I think it’s important to point out that if you’re looking for a beginner’s guide to NNUE, SF (even the early network architectures) is not exactly a great place to start. Your first networks have no need to be bucketed, horizontally mirrored, or have more than one hidden layer.

In fact a large portion of NNUE engines still use the “standard” (768 -> N)x2 -> 1 architecture, no buckets, 1 hidden layer. Input buckets are pretty common, but more layers specifically is really only seen in top engines.

Sopel · Post by **Sopel** » Sun Jan 14, 2024 4:12 pm

https://github.com/official-stockfish/n ... cs/nnue.md

lithander · Post by **lithander** » Thu Jan 18, 2024 2:02 pm

JacquesRW wrote: ↑Sat Jan 13, 2024 7:55 pm In fact a large portion of NNUE engines still use the “standard” (768 -> N)x2 -> 1 architecture, no buckets, 1 hidden layer.

Let's start with the input: 6 pieces x 2 colors x 64 squares = 768 inputs.

Each potential piece-squares combination has it's own slot in the input array (or as you say in math: vector)
If a piece is on a square you write a 1 in the associated slot in the input vector. This describes the current board in terms that a neural network can reason with and it means that of the 768 input values a maximum of 32 have a value of 1 and all the others have the value 0.

This "sparseness" is what makes it possible to implement the neural network in a way that is efficiently updateable. In theory running inference on a NN involves big costly matrix multiplications but due to the above property of the input vector we can do better: Instead of multiplying all of our weights with lots of zeros we can just take the subset of weights that are going to be multiplied with 1 and ignore the rest. The multiplication of a weight with 1 can also be skipped so in essence for each 1 in the input vector we take N weights and add them up in the accumulator.

Another interesting observation is that when we move a piece and look at what that means in the input vector then we realize only two numbers change. The number associated with that piece on the old square changes from 1 to 0 and the number associated with the piece on the new square changes from 0 to 1. So instead of starting from scratch each time a piece moves we use the previous accumulator state and adjust it: Subtract a set of weights associated with the old piece-square input and then add the set of weights associated with the piece being present on the new square.
A similar thing can be done with capturing, castling etc...

Implementing these tricks to make the neural network (NN) efficiently update (UE) gives massive speed increases over the naive matrix-multiplication approach. How fast the evaluation is also depends on the size of the hidden layer N. This size determines the size off the accumulator and linearly increases the cost of computing the evaluation. Twice the hidden layer size makes your eval twice as costly.

N can be as small as 16 and already be competitive with a handcrafted eval. (at least in Leorik's case) Yes, just 16 "neurons" outsmart my HCE.

N can also be much larger than 16 but then you rely on the better evaluation to outperform the loss in speed. These large networks require much larger quantities of training data to saturate so it's a good idea to start small and increase the hidden layer size as you gather more training data.

Another must-have is that you make use of 256bit wide hardware registers that all modern CPUs have. In C# you have to vectorize manually. But in C++ you can just write for-loops and expect the compiler to auto-vectorize the code for you. In any case you really want to use SIMD or otherwise your NNUE implementation is going to be too slow to be viable: Weights are typically encoded as 16bit integers so you can have 16 operations at the same time which I found to be a ~12x speed increase in practice.

To answer OPs question how to get started I would suggest the following steps which have worked well for me:

- Pick a chess engine with a simple NNUE architecture as explained above and read the inference code to understand what's going on.
- Modify your own engine so that it can evaluate positions using the same architecture and network file. Now it should have the same static evaluation result on all positions.
- For didactic reasons you could start with a naive math-guided implementation based on matrix multiplications. Then you can implement the tricks and see in comparison that they indeed produce the same result and why.
- When you have the borrowed net frankensteined on your own engine the next step is to train such a net yourself, ideally based on data you've generated yourself from selfplay games.

mvanthoor · Post by **mvanthoor** » Thu Jan 18, 2024 5:12 pm

lithander wrote: ↑Thu Jan 18, 2024 2:02 pm Let's start with the input: 6 pieces x 2 colors x 64 squares = 768 inputs.

Thanks. I'll be saving this post. I'll need it in another 3 years.

Good explanation. It seems easy, to some extent. (At least, it seems so, knowing the basics of neural networks and genetic algorithms)

The only thing I regret now is that I didn't go for a 32-core CPU last year to train and test faster, but this 16-core one was expensive enough already

arjunbasandrai · Post by **arjunbasandrai** » Fri Jan 19, 2024 11:46 am

lithander wrote: ↑Thu Jan 18, 2024 2:02 pm
JacquesRW wrote: ↑Sat Jan 13, 2024 7:55 pm In fact a large portion of NNUE engines still use the “standard” (768 -> N)x2 -> 1 architecture, no buckets, 1 hidden layer.
Let's start with the input: 6 pieces x 2 colors x 64 squares = 768 inputs.

Each potential piece-squares combination has it's own slot in the input array (or as you say in math: vector)
If a piece is on a square you write a 1 in the associated slot in the input vector. This describes the current board in terms that a neural network can reason with and it means that of the 768 input values a maximum of 32 have a value of 1 and all the others have the value 0.

This "sparseness" is what makes it possible to implement the neural network in a way that is efficiently updateable. In theory running inference on a NN involves big costly matrix multiplications but due to the above property of the input vector we can do better: Instead of multiplying all of our weights with lots of zeros we can just take the subset of weights that are going to be multiplied with 1 and ignore the rest. The multiplication of a weight with 1 can also be skipped so in essence for each 1 in the input vector we take N weights and add them up in the accumulator.

Another interesting observation is that when we move a piece and look at what that means in the input vector then we realize only two numbers change. The number associated with that piece on the old square changes from 1 to 0 and the number associated with the piece on the new square changes from 0 to 1. So instead of starting from scratch each time a piece moves we use the previous accumulator state and adjust it: Subtract a set of weights associated with the old piece-square input and then add the set of weights associated with the piece being present on the new square.
A similar thing can be done with capturing, castling etc...

Implementing these tricks to make the neural network (NN) efficiently update (UE) gives massive speed increases over the naive matrix-multiplication approach. How fast the evaluation is also depends on the size of the hidden layer N. This size determines the size off the accumulator and linearly increases the cost of computing the evaluation. Twice the hidden layer size makes your eval twice as costly.

N can be as small as 16 and already be competitive with a handcrafted eval. (at least in Leorik's case) Yes, just 16 "neurons" outsmart my HCE.

N can also be much larger than 16 but then you rely on the better evaluation to outperform the loss in speed. These large networks require much larger quantities of training data to saturate so it's a good idea to start small and increase the hidden layer size as you gather more training data.

Another must-have is that you make use of 256bit wide hardware registers that all modern CPUs have. In C# you have to vectorize manually. But in C++ you can just write for-loops and expect the compiler to auto-vectorize the code for you. In any case you really want to use SIMD or otherwise your NNUE implementation is going to be too slow to be viable: Weights are typically encoded as 16bit integers so you can have 16 operations at the same time which I found to be a ~12x speed increase in practice.

To answer OPs question how to get started I would suggest the following steps which have worked well for me:

- Pick a chess engine with a simple NNUE architecture as explained above and read the inference code to understand what's going on.
- Modify your own engine so that it can evaluate positions using the same architecture and network file. Now it should have the same static evaluation result on all positions.
- For didactic reasons you could start with a naive math-guided implementation based on matrix multiplications. Then you can implement the tricks and see in comparison that they indeed produce the same result and why.
- When you have the borrowed net frankensteined on your own engine the next step is to train such a net yourself, ideally based on data you've generated yourself from selfplay games.

Thanks so much for the explanation! It will definitely come in handy when I start the work on the neural net. Currently I am in the process of implementing Texel Tuning in my engine and once that's done, I will start building the NNUE.

As a side note, could you please take a look at this post I made regarding some weird behaviour during Texel Tuning?

viewtopic.php?f=7&t=83196

A4+5 · Post by **A4+5** » Fri Mar 08, 2024 9:32 am

smatovic wrote: ↑Fri Jan 12, 2024 5:06 pm I can recommend Neural Networks for Ches by Dominik Klein:

https://github.com/asdfjkl/neural_network_chess

covers Lc0 CNN and SF NNUE (chapter 4.6).

Stockfish made since version 12 several different network architecture iterations.

--
Srdja

Thank you so much for sharing this book.

How to get started with NNUE

How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE

Re: How to get started with NNUE