On the design of NNUE and possible ML approaches for evaluation

lir · Post by **lir** » Thu Jan 08, 2026 2:27 pm

Hello everybody, first I'd want to preface this by saying that I apologize if this was discussed before or if I may break any of the rules of the forum (I read the "please read before posting" thread, but still, point out to me anything not conforming if it's the case). Also sorry if this is not focused on exactly one question.

As far as I know, NNUE works as follows:
- Features, which are (piece, square, king position) tuples are embedded into a small-ish (i imagine) space R^k; each feature becomes a vector of real numbers basically
- The vectors are summed together to get the input to the neural net (addition is commutative, it does not matter in witch order you specify the pieces)
- A small, shallow network of fully connected layers does the processing, and outputs the scores for black and white.

The backwards pass not only updates the neural net itself, but also the embedding layer, which in the forward pass works just like a lookup table.

Those being said, for anybody more inclined towards deep learning, what is unclear to me is:

1.1) Is the training supervised? That is to say, we have a dataset of positions and their scores and just trained on those? That sounds bad in principle, it cannot become better than the dataset allows.
1.2) I am thinking on the following approach: Let the engine (that uses this evaluation) play a match against itself. We could use a loss that minimizes the difference between the score at move N and at move N + 1 (or 2, 3 i dont know) alongside a loss term encorporating the result of the game. Is something like this, not supervised us?

2) While I understand that is 100% a necesity for the network to be small, for performance reason, I would assume that it cannot model particularly complex functions. I would guess that if anything of substance "is learnt" (say movement of the pieces or patterns of piece placements) is moreso courtesy of the embedding rather than the small MLP. Representation learning seems important here. Perhaps self-supervised methods would be useful to learn useful latent representations of the data. First untested idea that comes to mind would be something similar to CLIP https://arxiv.org/abs/2103.00020, where we have an encoder for the positions, and 2 positions which are 1,2 moves away from each other are closer in latent space. I am just rambling at this point, but a small MLP on top of features learnt on a self supervised task seems like a great idea.

I will give my ideas a go after uni exams are over and after I finish a simple engine to put this on top of. For now I asked this in hopes something similar was done before, that I wasn't able to find online.

Thanks in adavance!

ovenel · Post by **ovenel** » Thu Jan 08, 2026 4:10 pm

I'm not going to pretend to known enough about NNUE to give an answer to your questions. But Stockfish does have very extensive documentation about their implementation that you may find helpful.

https://official-stockfish.github.io/do ... /nnue.html

lir · Post by **lir** » Thu Jan 08, 2026 5:30 pm

That's so in depth, thanks

Aleks Peshkov · Post by **Aleks Peshkov** » Thu Jan 08, 2026 7:59 pm

Simple NNUE but very effective scheme: 2*6*64 = 768 boolean inputs (only max 32 bits active at a time) fully connected to single hidden layer. Output is centipawn static evaluation score from side to move perspective that is simple sum of output of all activated hidden layer neurons.

When you implement the simplest form of NNUE chess evaluation you will understand that any extension and complication has inference performance cost, cost of training time and cost of needed learning data volume. Even 16-32-64 single layer neurons will surpass any hand crafted evaluation both in performance and quality. (Minifish with only 64 neurons was winner over full Stockfish pre-NNUE evaluation).

syzygy · Post by **syzygy** » Thu Jan 08, 2026 8:01 pm

lir wrote: ↑Thu Jan 08, 2026 2:27 pmAs far as I know, NNUE works as follows:
- Features, which are (piece, square, king position) tuples are embedded into a small-ish (i imagine) space R^k; each feature becomes a vector of real numbers basically

Conceptually, feature n being absent or present corresponds to a vector whose n-th element is a 0 or 1.

The backwards pass not only updates the neural net itself, but also the embedding layer, which in the forward pass works just like a lookup table.

Maybe I misunderstand, but I think evaluating a position involves only forward passes.

1.1) Is the training supervised? That is to say, we have a dataset of positions and their scores and just trained on those? That sounds bad in principle, it cannot become better than the dataset allows.

If you train evaluation function N+1 on the position scores obtained by searching the positions to depth d and evaluating them using evaluation function N, then evaluation function N+1 could be well be better than evaluation function N.

1.2) I am thinking on the following approach: Let the engine (that uses this evaluation) play a match against itself. We could use a loss that minimizes the difference between the score at move N and at move N + 1 (or 2, 3 i dont know) alongside a loss term encorporating the result of the game. Is something like this, not supervised us?

I believe that is what Deepmind's AlphaGo/AlphaZero Chess etc. do.

There are currently two approaches (that I am aware of):
- huge nets running on GPUs combined with MCTS (AlphaZero Chess, LC0).
- the NNUE approach with an alpha-beta search as in the conventional chess engine of the 1950s to 2010s but with the handcrafted evaluation function replaced by a "small" network that can be evaluated quickly using SIMD instructions of the CPU.

lir · Post by **lir** » Fri Jan 09, 2026 2:17 am

Aleks Peshkov wrote: ↑Thu Jan 08, 2026 7:59 pm Simple NNUE but very effective scheme: 2*6*64 = 768 boolean inputs (only max 32 bits active at a time) fully connected to single hidden layer. Output is centipawn static evaluation score from side to move perspective that is simple sum of output of all activated hidden layer neurons.

syzygy wrote: ↑Thu Jan 08, 2026 8:01 pm Conceptually, feature n being absent or present corresponds to a vector whose n-th element is a 0 or 1.

I see, those features are just the inputs themselves, that's where I was wrong.

That works great for performance, the matrix multiplication of the first layer can be computed as fast as you would if it had 32 columns, not 768. You just ignore the columns corresponding to 0 entries in the input vector.

If N is the number of pieces and H is the size of the hidden layer, then there are H*(N-1) additions to compute that. With 512 bit SIMD instructions (provided we use 8 bit integers), that becomes (H/64)(N-1). That's awesome.

Or alternatively one might store the output of this first layer and update it during make_move(), undo_move().

I might think too much into it without trying to write it into code

sscg13 · Post by **sscg13** » Fri Jan 09, 2026 6:59 am

lir wrote: ↑Thu Jan 08, 2026 2:27 pm Features, which are (piece, square, king position) tuples

These are standard features but not necessarily the only ones, see https://github.com/official-stockfish/n ... eature-set

lir wrote: ↑Thu Jan 08, 2026 2:27 pm The vectors are summed together to get the input to the neural net (addition is commutative, it does not matter in witch order you specify the pieces)

Since consecutive positions to be evaluated are usually similar to each other, it is faster to compute the difference in features and use add/subtract, instead of adding every feature together.

lir wrote: ↑Thu Jan 08, 2026 2:27 pm is moreso courtesy of the embedding rather than the small MLP

Typically, one considers the first layer (binary features -> accumulator) as part of the neural network itself, as you do not necessarily need further layers (you can apply an activation function directly to the accumulators and then take a linear combination to get an evaluation.)

Aleks Peshkov · Post by **Aleks Peshkov** » Fri Jan 09, 2026 8:42 am

lir wrote: ↑Fri Jan 09, 2026 2:17 am That works great for performance, the matrix multiplication of the first layer can be computed as fast as you would if it had 32 columns, not 768. You just ignore the columns corresponding to 0 entries in the input vector.

Large part of the net is not recomputed each time during game. NNUE store the state of the first hidden layer (called accumulator) and need to update (add or substract) only 2-3 changed features (piece from, piece moved to, captured piece) each move. To make it possible actually two accumulator twins are updated: for side to move perspective and other side. And hidden layer is { accumulator[stm], accumulator[ntm] }

On the design of NNUE and possible ML approaches for evaluation

On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation

Re: On the design of NNUE and possible ML approaches for evaluation