Thoughts on "Zero" training of NNUEs

Discussion of chess software programming and technical issues.

Moderators: hgm, chrisw, Rebel

User avatar
johnhamlen65
Posts: 31
Joined: Fri May 12, 2023 10:15 am
Location: Melton Mowbray, England
Full name: John Hamlen

Re: Thoughts on "Zero" training of NNUEs

Post by johnhamlen65 »

Thanks Colin. That diagram and link make it much clearer to me about what is going on.
I'm sure the 2 accumulator system much work better, as it seems to be universally adopted. However, being a bear of very little brain, I still can't see why. It seems to me that, once trained, the weights for the top and bottom halves will be simply (sort of) mirror images of each other, and not bring anything else to the party.

Maybe it is just the advantage of a sort of "enhanced data" for the neural network training, so you get double the training inputs? However, if this were the only advantage, I'd imagine it would be better to simply halve the size of the network, and present the "standard" and "flipped" positions as 2 seperate training inputs, the same way they do when training image recognition networks?

How are you getting along with your single-accumulator model?
All the best
John
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

johnhamlen65 wrote: Tue Nov 05, 2024 8:47 am
David Carteau wrote: Fri Nov 01, 2024 9:45 am
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.
Hello John,

I'm glad that some people are interested in my experiments! I was myself fascinated by reading the blogs of Jonatan Pettersson (Mediocre) and Thomas Petzke (iCE) a few years ago, and these two guys made me want to develop my own chess engine :)
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm As simple a network as possble (768->(32x2)->1 ???)
I would say that you can start with a simple network like 2x(768x128)x1: it will give you a decent result and a great speed.

But be aware that there's always a trade-off between speed and accuracy: bigger networks will improve accuracy and reduce speed, but at the same time the search will take advantage of this accuracy and you'll be able to prune more reliably, for example.

Point of attention: in my experiments, networks with two perspectives always gave me better results (and training stability), i.e. prefer 2x(768x128)x1 to 768x256x1 (maybe someone else could share its thoughts on that).
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game
Yes, that's the big difficulty: how to turn such "poor" labels (game results) into valuable information to train a strong network ;)

Connor McMonigle (Seer author) had a really clever idea about this (it's not really "tapering", but the idea remains to start from endgame positions, where labels are more reliable)!

David
Thanks for all the links and pointers David!

I'm keen to get started on experimenting with training a (very) simple once of go my head around things. At the moment I'm still a bit confused about things as basic as how the 2 halves of a 2 x 768 input layer would differ (see my answer to Engin). So maybe I've got rather a long way to go before being "productive"! :D

Thanks for pointing me towards Connor's Seer ideas. I know we're a long way from solving chess - (good!!) - but it reminds me of what Chinook was doing for checkers.

All the best,
John
i am filling only that side who is on turn, other half of the 768 let it by zero, so the net know wich side is active or not.

all input size is 2x768 init with zero

for example start_index = (turn ? 0 : 768);
then pieces fill with 1 (input(start_index + (piece x 64 + square_of_the_piece)) = 1;

this nearly the same what NNUEs are doing with King Piece Square but i do only wich side is on move just with 2 sides, not with 64 squares of the king.

But you can even let it only from white perspective with 768, if the target outcome is only 0.0 or 0.5 or 1.0 allways white view, for black you can that reverse.

That what i try to do at the moment, its pretty fast :-) but if that usefull we will see.
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

johnhamlen65 wrote: Tue Nov 05, 2024 9:54 am
Engin wrote: Mon Nov 04, 2024 5:22 pm
Does anybody know how ist formula of winning percentage WDL to centipawn and also back again is ?
Hi Engin,
You can reverse the Lichess Accuracy Metric to get this:
centipawns = log((2 / ((win_probability - 0.5) * 2 + 1)) - 1) / -0.00368208
Hope this helps,
John
Ah, looks like pretty good, thanks john !
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

johnhamlen65 wrote: Tue Nov 05, 2024 8:32 am Hi Engin,
Great meeting you and playing Tornado in Spain.
"Understanding NNUE is very hard" - Well I can't agree with you more on that one! :D . Just when I thinnk I've got a good grip on things, I realise that I don't!
For instance: I (sort of!) understand why the inputs of the Stockfish NNUE are split into two, because the two halves describe how all the peices on the board relate to 1) the king of the side to move, and 2) the king of the side not to move.
However, if we have 2 x 768, how would the two halves differ? The 768 bits fully describe the position on the board. I could understand 2 x 384 inputs, one for the pieces of side to move, and one for the side not to move.
Or have I got completely the wrong end of the stick? (Almost certainly :D :D ).
Many thanks,
John
If you don't get lost in the details, the basics is actually quite simple:

Evaluating through a single PST does not pay attention to who has the move. Of course you can fix that by awarding a 'side-to-move bonus', but in real life it depends very much on the position how much the right to move first is worth. When both players have a decisive mating attack (as sooner or later is the case), the player not on move really doesn't care much about his own attack, as he will never get to even commence on it before he gets checkmated. His priority is defending against the mating attack the side to move will launch on him. So the optimal placing of his pieces will be entirely different: he will have to group those around his own King, rather than aim them at the opponent's. That is not the same as aiming them at his own King: to effectively defend the pieces should aim at the squares his own King could be checked from. Even when his turn comes up after that, this doesn't change, as he will usually be in check, forcing him to play a defensive move.

So a sensible evaluation needs two different PST, one for the attacker (the side to move), and a different one for the defender. Imagine a chess variant where Kings are not allowed to move, but are fixed on e1/e8. Then you could use ordinary 8x8 PST for awarding good locations for attacking the opponent for the side to move, and awarding good locations to protect your own King for the side not to move. The evaluation from the POV of the side to move would then be the PST sum from the first table for his own pieces, minus the PST sum from the second table for the pieces of his opponent.

This means that the usual incremental update does not work: on the next ply the role of attacker and defender has switched, and all pieces now use different PST. But after two ply, the same player is on move again. So the evaluation after two ply is very much related to the current one; you just have to update it for the two latest moves. One can thus still use incremental updating of the PST score, but have to keep two incrementalEval variables ('accumulators'). One for use at odd plies, the other at even plies. Each ply you update both in the usual way. Except that the second one is updated with the two PST swapped.

That is really all the mistery of the two 'halves', and it does not not even involve a Neural Net. (Unless you consider a PST a single-layer NN.) Now in orthodox Chess the Kings of course can move, and do not always stay in the same location. So to make this work you cannot get by with just one pair of PST, but you would need one for each King location. Hence the KPST, which add the King location as extra dimension to the arrays.

A third ingredient is the introduction of (multi-layer) NN. This is really an independent technique; it could also have been done when using the same PST for both the stm and the opponent. Instead of one PST (or pair of KPST) you use a large number of those (typically 256). And have (2x) 256 different incrementally updated PST sums. And these are then fed into a NN, which somehow combines these PST sums in a learned way to make a single score of those. That is all.

And if the description of a typical game sounds weird and unrealistic to you: of course it would, because it was not a description of a game of orthodox Chess, but of a typical Shogi game. For which NNUE was designed.
chesskobra
Posts: 292
Joined: Thu Jul 21, 2022 12:30 am
Full name: Chesskobra

Re: Thoughts on "Zero" training of NNUEs

Post by chesskobra »

Sorry, I don't follow this fully (but would like to follow one day), so here is a noob question. In some games (especially the so called impartial games in combinatorial game theory), they use the terms e.g. Alice and Bob to mean that Alice opens the game and Bob responds, and also the terms 'first player' and 'last player' to mean the player whose turn it is to move and the player who made the last move, respectively. This suggests that we could do the same in chess (even though it is not an impartial game) and represent a position by an array of 768 bits always from the first players perspective (that is from the pov of the player whose turn it is to move), and when a move is made, we flip the array so as to represent the position from the pov of the player to move. Or we maintain 2 boards, one from each player's perspective, but we could train only one NN and give it as input the board of the 'first player' (player to move), so that the NN will 768 x ... Does this make sense?
JacquesRW
Posts: 113
Joined: Sat Jul 30, 2022 12:12 pm
Full name: Jamie Whiting

Re: Thoughts on "Zero" training of NNUEs

Post by JacquesRW »

johnhamlen65 wrote: Tue Nov 05, 2024 5:24 pm Thanks Colin. That diagram and link make it much clearer to me about what is going on.
I'm sure the 2 accumulator system much work better, as it seems to be universally adopted. However, being a bear of very little brain, I still can't see why. It seems to me that, once trained, the weights for the top and bottom halves will be simply (sort of) mirror images of each other, and not bring anything else to the party.
In terms of motivation - if you want STM-relative inputs (which you do, its worth a hell of a lot of elo) you'll find yourself needing to keep track of an accumulator from each perspective anyway, so why not use both available accumulators in eval?
It is equivalent to doubling the hidden layer size with a (rather strong) constraint on the second half of the weights in that first layer - them being a vertical mirror of the weights in the first half. This does not mean that the two accumulators will look remotely alike nor that the weights in the subsequent layers will show the same symmetry.
The question is whether this gains enough accuracy to outweigh the more expensive inference from the two accumulators -> output, rather than just the stm accumulator -> output. The answer to this is a firm yes.
hgm wrote: Tue Nov 05, 2024 10:10 pm A third ingredient is the introduction of (multi-layer) NN. This is really an independent technique; it could also have been done when using the same PST for both the stm and the opponent. Instead of one PST (or pair of KPST) you use a large number of those (typically 256).
256 is on the tiny side by today's standards. For SF an L1 size of 256 was only seen in the very first SFNNv1 architecture, and the current size is 3072.
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

chesskobra wrote: Wed Nov 06, 2024 12:29 am Sorry, I don't follow this fully (but would like to follow one day), so here is a noob question. In some games (especially the so called impartial games in combinatorial game theory), they use the terms e.g. Alice and Bob to mean that Alice opens the game and Bob responds, and also the terms 'first player' and 'last player' to mean the player whose turn it is to move and the player who made the last move, respectively. This suggests that we could do the same in chess (even though it is not an impartial game) and represent a position by an array of 768 bits always from the first players perspective (that is from the pov of the player whose turn it is to move), and when a move is made, we flip the array so as to represent the position from the pov of the player to move. Or we maintain 2 boards, one from each player's perspective, but we could train only one NN and give it as input the board of the 'first player' (player to move), so that the NN will 768 x ... Does this make sense?
I think this amounts to the same as what I have been saying. The crux is that in evaluating the position from the perspective of the first player uses a different PST for his own pieces than for those of the the second player. So color-reversing the board would not just flip the sign of the eval, as it would in conventional chess evaluations, where both use the same PST.
User avatar
hgm
Posts: 28216
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

JacquesRW wrote: Wed Nov 06, 2024 3:34 am256 is on the tiny side by today's standards. For SF an L1 size of 256 was only seen in the very first SFNNv1 architecture, and the current size is 3072.
Having a table indexed by the location of two pieces (such as King and a non-royal) is very general; it can accomodate a completely different scoring for a King on e4 and one on d4. The problem is that in practice such a scoring is probably not what you finally want: it stands to reason that the most important aspect of the King dependence for piece placement is the relative position, pieceSqr - kingSqr. The ability to express anythhing goes at the expense that it has to learn that a particuar aspect of the evaluation is translation invariant. OTOH, if the translation invariance was enforced by the network structure, the network would not have to learn the same thing in many independent tables, and would have a much larger ability to generalize patterns it recognizes in the training examples to other board locations.

So I would expect NNUE that replace at least part of the (8x8)x(8x8) KPST tables by 15x15 tables indexed by the square-number difference (in 0x88 numbering) would lead to more easily trainable networks; the 15x15 relative tables would pick up terms that purely depend on relative positioning rather than proximity of the edges much quicker.

This could also mean you can do with fewer tables. Training an NN starting from randomized weights can only be successful if by pure chance some of the initial random patterns correlate by pure chance with what the NN needs to learn. To make it sufficiently likely that there is a table that initially correlates with the desired pattern for every King location requires many more tables that to make it similarly likely that a single position-relative table set contains such a table.
op12no2
Posts: 525
Joined: Tue Feb 04, 2014 12:25 pm
Location: Gower, Wales
Full name: Colin Jenkins

Re: Thoughts on "Zero" training of NNUEs

Post by op12no2 »

johnhamlen65 wrote: Tue Nov 05, 2024 5:24 pm How are you getting along with your single-accumulator model?
All the best
John
Hi John,

It's a lot of fun and I'm learning a lot. As a Javascript engine, performance is key and I have been surprised to discover that 128 is viable as a hidden layer; training 256 as I type. Lots more to experiment with like multiple-accumulators, quantisation, output buckets, alternative activators, deferred UE, more training data, different optimisers, different sigmoid scaling etc. I find it far more fun that playing around with a HCE and it's relatively straightforward to get a basic implementation to work from; you don't need UE to get going for example.

https://github.com/op12no2/lozza
User avatar
johnhamlen65
Posts: 31
Joined: Fri May 12, 2023 10:15 am
Location: Melton Mowbray, England
Full name: John Hamlen

Re: Thoughts on "Zero" training of NNUEs

Post by johnhamlen65 »

chesskobra wrote: Wed Nov 06, 2024 12:29 am Sorry, I don't follow this fully (but would like to follow one day), so here is a noob question. In some games (especially the so called impartial games in combinatorial game theory), they use the terms e.g. Alice and Bob to mean that Alice opens the game and Bob responds, and also the terms 'first player' and 'last player' to mean the player whose turn it is to move and the player who made the last move, respectively. This suggests that we could do the same in chess (even though it is not an impartial game) and represent a position by an array of 768 bits always from the first players perspective (that is from the pov of the player whose turn it is to move), and when a move is made, we flip the array so as to represent the position from the pov of the player to move. Or we maintain 2 boards, one from each player's perspective, but we could train only one NN and give it as input the board of the 'first player' (player to move), so that the NN will 768 x ... Does this make sense?
Thanks H.G. for taking the time to document such a comprehensive explanation of what is going on. I feel that in the future your post will be referenced and be of help to future chess programmers.

However, this means that I'm a little embarassed to have still not "grasped it". (As I said, I'm a bear of very little brain sometimes! :D ).

I get the part about the the attacking and defending players in a position have two seperate perspectives, but don't understand how this leads to the need for two seperate nets or PSQTs. Being a zero-sum game I would ezpect that a good placement of defending pieces will be exactly as bad for the attacker as they are good for the defender, and vice versa for the attacking piece placement.

Having said that, I do get that being the side to move is (usually) an advantage. (In the past (with Woodpusher) I added some unscientific bonus for this into the evaluation, but never quatified how much difference it made). With NNUE - and I think this is what Chesskobra is suggesting(?) - can we "just" train the network so the evaluation of (E) of every position is from the perspective of the STM (or White), and the evaluation from the perspective of the other side (or Black) is always -E.

Of course we'd need to propogate the STM signal through the network layers by using a sort of ResNet architecture, but this seems like a small price to pay to halve the number of inputs and exponentially reduce the number of first layer weights to train! I'm sure I must have missed something somewhere?!?
Like Chesskobra, I would like to follow one day! :D
Thanks again everybody,
John

Image
Credit: cv-tricks.com ResNet, AlexNet, VGGNet, Inception: Understanding various architectures of Convolutional Networks