HalfKP Structure in NNUE

ndbd · Post by **ndbd** » Tue Jan 12, 2021 9:45 am

Apologies for cross-posting on different sites. I've already posted this question on chess.stackexchange, but it seems it does not reach the proper audience there (few views, no replies despite bounty).

After some more googling I found this board with lots of discussion relevant to this topic, so there is apparently some expertise. Here are my questions:

Stockfish now uses a neural network for position evaluation. The structure of this neural network was introduced for Shogi by Yu Nasu in the paper dubbed "Efficiently updatable neural network - NNUE". Unfortunately it is in Japanese.

It is a fully connected neural network with four layers. The first layer receives an encoded chess position. The second and third are hidden layers and then there is final output layer that gives the evaluation.

The input layer receives a chess position encoded as a "HalfKP" structure, and this is where I am completely lost. A picture is available here.

The encoding seems to work like this. Suppose we have the starting position. We first consider only the white pieces. We encode the position by always considering the position of the white king together with a white non-king piece for each square. The resulting value is then either '1' or '0'.

Consider the starting position. We have

King on a1, Pawn on a1 -> 0
King on a1, Pawn on a2 -> 0
...
King on e1, Pawn on a2 -> 1
...
King on e1, Pawn on e2 -> 1
...

and so on. This input is fully connected to the next layer. Then we do the same for the black pieces.

Question 1.) Is my understanding correct?

Suppose we have the move e2e4. There are only two inputs which change:

King on e1, Pawn on e2 changes from 1 to 0
King on e1, Pawn on e4 changes from 0 to 1

There is apparently some efficiency gain here claimed in the original paper, but I do not see where. After all, the two nodes above are connected to every node in the second layer, so we have to update all second layer nodes.

Question 2.) Why is this efficient?

An encoding that would be much more straight-forward would be to simply use bits to indicate positions of the pieces, similar to bitboards.

King on a1 -> 0
...
King on e1 -> 1
...
Pawn on e2 -> 1
Pawn on e4 -> 0

If we use such an input encoding, we have less input nodes. Also updating for the move e2e4 just changes two input nodes. Such input seems to be common in other approaches, like AlphaZero and Lc0.

Question 3.) Why does one not use such a much easier encoding? Why do we use combinations of King + Piece? What do we gain here?

Question 4.) The first layer uses 16 bit integers, the next layers 8 bit integers. What is the reason for this choice? Obviously we need to limit the range to operate with fixed-point arithmetic due to memory constraints, but why 16 and 8, and not 16 and 16, or 32bit and 32bit?

The first half of the input layer (for white) is fully connected to the first 256 nodes of the second layer; the second half of the input (for black) is fully connected to the latter 256 nodes of the second layer.

Question 5.) Why do we consider the white and black pieces separately? What is the benefit? What is the so-called (full) KP, and what is the relation to HalfKP?

Question 6.) Is it possible to illustrate the update and efficiency gain with the above e2e4 example?

BrianNeal · Post by **BrianNeal** » Tue Jan 12, 2021 10:41 am

ndbd wrote: ↑Tue Jan 12, 2021 9:45 am Stockfish now uses a neural network for position evaluation. The structure of this neural network was introduced for Shogi by Yu Nasu in the paper dubbed "Efficiently updatable neural network - NNUE". Unfortunately it is in Japanese.

dkl made translations in English and German: http://talkchess.com/forum3/viewtopic.php?f=2&t=76250

Sesse · Post by **Sesse** » Tue Jan 12, 2021 10:45 am

ndbd wrote: ↑Tue Jan 12, 2021 9:45 am Question 1.) Is my understanding correct?

Seems correct to me.

Suppose we have the move e2e4. There are only two inputs which change:

King on e1, Pawn on e2 changes from 1 to 0
King on e1, Pawn on e4 changes from 0 to 1

There is apparently some efficiency gain here claimed in the original paper, but I do not see where. After all, the two nodes above are connected to every node in the second layer, so we have to update all second layer nodes.

Question 2.) Why is this efficient?

Yes, the second layer needs to be updated. But it's small compared to the first layer, and the first layer can be updated incrementally. All the gains are in the first layer.

Question 3.) Why does one not use such a much easier encoding? Why do we use combinations of King + Piece? What do we gain here?

It's a bit of black magic.

The idea is: Where can we get the most bang per buck? It turns out that due to the incremental updates, the first layer is much cheaper to evaluate than the other layers. Having a large first layer gives us some gain and costs us little; having a large second layer gives us perhaps slightly more gain, but costs much much more! Obviously, we want to make the first layer larger but keep a smaller second layer. Thus the over-parametrization, which gives us a clever way of getting more input features.

In most traditional neural networks, you should never do this; you should let the layers just figure out any over-paramerizations themselves. But in NNUE, it's a win since we need that richness as soon as possible.

Question 4.) The first layer uses 16 bit integers, the next layers 8 bit integers. What is the reason for this choice? Obviously we need to limit the range to operate with fixed-point arithmetic due to memory constraints, but why 16 and 8, and not 16 and 16, or 32bit and 32bit?

I don't know the details, but it's not only about memory constraints; it's also about SIMD efficiency. With AVX2, you have 256-wide registers; those can do 32-wide adds for 8-bit or 16-wide adds for 16-bit. Obviously, if your NN works well enough with 8-bit values, that means more speed (which maybe you can trade for more weights).

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Jan 12, 2021 10:51 am

Here you can find a translation of the original paper:
http://talkchess.com/forum3/viewtopic.php?f=2&t=76250
And here an explanation of how works the neural network in stockfish:
https://www.chessprogramming.org/Stockfish_NNUE

Question 1: You are correct
Question 2: It's efficient because you have already computed the fist layer with the position before that move. And you have only to add and subtract the weights that affect the move.
For example from the starting position you have:
Ke1 Pa2
Ke1 Pb2
...
with a total of 30 weights
Now you have already computed the first layer and you have the move e2e4, you have only to add "Ke1 Pe4" and subtract "Ke1 Pe2", only 2 operation instead if you haven't computer previously the position you have to do:
Ke1 Pa2
Ke1 Pb2
...
Ke1 Pe4
...
with a total of 30 weights
30 operation of the full calculation compared to only 2 operation of the net with incremental update.

Question 3: I don't understand what type of encodings you want to use since to evaluate the board you have to know all the piece positions
Question 4: The first layer use int16 for summing the weights, int8 is used for faster operations, with avx2 you can do 32 operations per instruction. If you use int16 you can do only 16 operations per instruction.
Question 5: When you move a king you have only to recompute an half of the fist layer and not the whole layer. It's an implementation, you can use different ones and maybe get similar results.
Question 6: I have already made an example with Question 2.

AndrewGrant · Post by **AndrewGrant** » Tue Jan 12, 2021 11:55 am

ndbd wrote: ↑Tue Jan 12, 2021 9:45 am Apologies for cross-posting on different sites. I've already posted this question on chess.stackexchange, but it seems it does not reach the proper audience there (few views, no replies despite bounty).

I took a stab at answering each of your questions:
https://chess.stackexchange.com/questio ... 3736#33736

I'll note that I have not read the Shogi NNUE paper, but I have implemented the NNUE HalfKP paradigm myself, and tackled all of the questions you posed during that process. Particularly, the questions about how the data is represented in memory, which is of great importance and likely understood by very few people at this time.

Rein Halbersma · Post by **Rein Halbersma** » Tue Jan 12, 2021 12:45 pm

AndrewGrant wrote: ↑Tue Jan 12, 2021 11:55 am I took a stab at answering each of your questions:
https://chess.stackexchange.com/questio ... 3736#33736

Upvoted!

hgm · Post by **hgm** » Tue Jan 12, 2021 1:49 pm

ndbd wrote: ↑Tue Jan 12, 2021 9:45 amQuestion 2.) Why is this efficient?

'Efficient' is a relative notion. I suppose it means here that the first layer can be updated inccrementally, and since the inputs are 0 o 1 don't need a multiplication. And even the incremental update is relatively cheap, because on a typical move only two inputs change. Only on King moves (which are much rarer), inputs for all other pieces change.

And it is not about updating the inputs, but for calculating the inputs to the next layer. Bitboards would be no help there; each bit in the bitboard would have to be multiplied by its own weight. In fact the actual 0/1 inputs only exist conceptually; they are implied by the normal board representation used by the search. When you apply a move to that, you directly apply the change to the inputs of the cells in the next layer of the NN, by adding and subtracting the weights there, before 'running' the rest of the NN. Typically like

for(n=0; n<256; n++) input[n] += KPST[n][promotedMovedPiece][kingSqr][toSqr] - KPST[n][movedPiece][kingSqr][fromSqr] - KPST[n][victim][kingSqr][captSqr];

Question 3.) Why does one not use such a much easier encoding? Why do we use combinations of King + Piece? What do we gain here?

This is basically a Shogi thing. Chess engines typically use Piece-Square tables PST[pieceType][sqr] to judge how well the pieces are placed. In Shogi it is more important how pieces are placed relative to the Kings, as most pieces there only move at crawling speed, and the game is a race to mate. Slow pieces far from the King have almost no value at all. So it is more natural to judge the placement of pieces by a table KPST[pieceType][kingSqr][sqr], which can emulate a purely relative scoring DPST[pieceType][sqr - kingSqr].

For chess this king-relative scoring also appears to work, or at least not hurt; the extra work involved is limited, because King moves are rare.

The NNUE innovation is basically to not use a single PST or KPST, but a large number, and feed all those scores to a relatively small NN that forges a single score out of it. Note that a PST or KPST in principle can also be used to calculate things like game phase, which can be used by the NN non-linearly to select or interpolate between other KPST scores.

Question 4.) The first layer uses 16 bit integers, the next layers 8 bit integers. What is the reason for this choice? Obviously we need to limit the range to operate with fixed-point arithmetic due to memory constraints, but why 16 and 8, and not 16 and 16, or 32bit and 32bit?

I suppose that it was empirically established how many bits are enough. You would want to use the smallest number of bits that still suffices, to speed up the calculation by increasing the parallelism with SIMD instructions.

Question 5.) Why do we consider the white and black pieces separately? What is the benefit? What is the so-called (full) KP, and what is the relation to HalfKP?

The distance of a piece to its own King and to the opponent King are fundamentally different quantities, which need different scoring. The distance between the white King and a black piece, and that between the black King and a White piece are quantities related by reflection and color reversal, though. So the whole network consists of two parts that have the same topology and weights, but use color-reversed inputs.

ndbd · Post by **ndbd** » Tue Jan 12, 2021 3:12 pm

Wow, I did not expect such quick and insightful replies, thanks to each of you.

In particular I was not aware what is important in a Shogi position.

I think all my questions are, but of course I need some time to digest

HalfKP Structure in NNUE

HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE

Re: HalfKP Structure in NNUE