Thoughts on "Zero" training of NNUEs

Discussion of chess software programming and technical issues.

Moderators: hgm, chrisw, Rebel

User avatar
johnhamlen65
Posts: 31
Joined: Fri May 12, 2023 10:15 am
Location: Melton Mowbray, England
Full name: John Hamlen

Thoughts on "Zero" training of NNUEs

Post by johnhamlen65 »

Dear Computer Chess Connoisseurs,

After getting a complete spanking in Santiago de Compostela with my homage the the original technology chess programs (https://www.chessprogramming.org/Tech), I've returned home licking my wounds, but also inspired to to create a new chess engine focused on performance rather than romantic historical accuracy! Of course this now means implementing a NNUE - rather than another material-only - evaluation function. :P

The idea of initially training the network on positionsl evaluated by Stockfish, LC0, or any other engine (even my own), doesn't inspire me very mush. However, the idea of following in AlphaZero's footsteps, and starting tabula rasa really does!

I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.

Not having 5000 TPUs to work with, I feel that at least two things are essential:
  • As simple a network as possble (768->(32x2)->1 ???)
  • Higher quality labels, even working with just 0, 0.5, and 1
For the first, does anyone have a feel for when we hit diminishing returns in terms of network size?

For the second, it occurs to me that, in the extreme, the starting position adds nothing to the training data, whereas positions near the end of the game add a lot, and positions in the middle game, somewhere inbetween. For instance, I'd imagine that the labels for most rook and pawn endgames could be of very high quality (mostly 0.5).

Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game, whether this be athe start of a stunning, game-winning combination in the middlegame (label=1 or -1), or a position heading towards an insufficient-material draw in an endgame.

Many thanks for your thoughts :D
John
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm Dear Computer Chess Connoisseurs,

After getting a complete spanking in Santiago de Compostela with my homage the the original technology chess programs (https://www.chessprogramming.org/Tech), I've returned home licking my wounds, but also inspired to to create a new chess engine focused on performance rather than romantic historical accuracy! Of course this now means implementing a NNUE - rather than another material-only - evaluation function. :P

The idea of initially training the network on positionsl evaluated by Stockfish, LC0, or any other engine (even my own), doesn't inspire me very mush. However, the idea of following in AlphaZero's footsteps, and starting tabula rasa really does!

I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.

Not having 5000 TPUs to work with, I feel that at least two things are essential:
  • As simple a network as possble (768->(32x2)->1 ???)
  • Higher quality labels, even working with just 0, 0.5, and 1
For the first, does anyone have a feel for when we hit diminishing returns in terms of network size?

For the second, it occurs to me that, in the extreme, the starting position adds nothing to the training data, whereas positions near the end of the game add a lot, and positions in the middle game, somewhere inbetween. For instance, I'd imagine that the labels for most rook and pawn endgames could be of very high quality (mostly 0.5).

Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game, whether this be athe start of a stunning, game-winning combination in the middlegame (label=1 or -1), or a position heading towards an insufficient-material draw in an endgame.

Many thanks for your thoughts :D
John
Hi John,
to understanding NNUE is very hard, i am my self did not even understand them all how can be so large inputs 2x10x64x64 (= 81920) ->2x256->32->32->1 can fast working for the AB search algorithm, yes i know about SIMD, but get not really the speed what i am try to do something different with my own neural network with smaller size of networks like 2x768 for each side an then just only 2x8 to 1 output, the network take a lot of time to learn something, well bigger sizes including the king square drops down a lot. i got a little bit speedup of x10 with OpenMP SIMD for the forward function now, let's see.

Of course you can get a lot more speed with quantize the model from floating point to integer later after the training prozess like the NNUE are doing. Training with integer's are not possible :/
David Carteau
Posts: 131
Joined: Sat May 24, 2014 9:09 am
Location: France
Full name: David Carteau

Re: Thoughts on "Zero" training of NNUEs

Post by David Carteau »

johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.
Hello John,

I'm glad that some people are interested in my experiments! I was myself fascinated by reading the blogs of Jonatan Pettersson (Mediocre) and Thomas Petzke (iCE) a few years ago, and these two guys made me want to develop my own chess engine :)
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm As simple a network as possble (768->(32x2)->1 ???)
I would say that you can start with a simple network like 2x(768x128)x1: it will give you a decent result and a great speed.

But be aware that there's always a trade-off between speed and accuracy: bigger networks will improve accuracy and reduce speed, but at the same time the search will take advantage of this accuracy and you'll be able to prune more reliably, for example.

Point of attention: in my experiments, networks with two perspectives always gave me better results (and training stability), i.e. prefer 2x(768x128)x1 to 768x256x1 (maybe someone else could share its thoughts on that).
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game
Yes, that's the big difficulty: how to turn such "poor" labels (game results) into valuable information to train a strong network ;)

Connor McMonigle (Seer author) had a really clever idea about this (it's not really "tapering", but the idea remains to start from endgame positions, where labels are more reliable)!

David
Download the Orion chess engine --- Train your NNUE with the Cerebrum library --- Contribute to the Nostradamus experiment !
User avatar
hgm
Posts: 28205
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Thoughts on "Zero" training of NNUEs

Post by hgm »

I would expect to get better performance from networks of the same size when you adapt the network better to the type of information that we know is important in Chess. The networks that are now popular use king-piece-square tables, which is very good for determining king safety. Which in Shogi, from which the NNUE idea was copied, of course is the overwhelmingly dominant evaluation term.

In chess King Safety is also important, but Pawn structure might be even more important. With only king-piece-square inputs in the first layer recognizing other piece relations (e.g. passers) must be done indirectly, by the deeper layers of the network. Which puts a heavier burden there, and thus probably needs a larger network to perform it.

So it might be worth it to also include some tables that depend on the relation between Pawns and other pieces, and Pawns w.r.t. each other. Like Kings, Pawns are not very mobile either. So when some of the tables are pawn-piece-square rather than king-piece-square, most moves would still only have to update the input they provide to the next layer by accounting for the moved piece, rather than having to recalculate everything because the Pawn moved.
CRoberson
Posts: 2080
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Thoughts on "Zero" training of NNUEs

Post by CRoberson »

johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm
...

For the second, it occurs to me that, in the extreme, the starting position adds nothing to the training data, whereas positions near the end of the game add a lot, and positions in the middle game, somewhere inbetween. For instance, I'd imagine that the labels for most rook and pawn endgames could be of very high quality (mostly 0.5).

Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game, whether this be athe start of a stunning, game-winning combination in the middlegame (label=1 or -1), or a position heading towards an insufficient-material draw in an endgame.
Hi John,
Great meeting you and Carol in Spain! You are on the right track with that idea. There are papers written on TDL (Temporal Difference Learning) from the late 1980s and early 1990s. The ones to read are by Gerald Tessauro of the IBM Thomas J. Watson research center on Backgammon. He has papers on using neural nets in the normal fashion and on TDL. TDL was better. The idea is based on the fact that in at least long games the likelyhood that the first several moves are the problem is far smaller than a move(s) closer to the end of the game.

Regards,
Charles
Engin
Posts: 978
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: Thoughts on "Zero" training of NNUEs

Post by Engin »

David Carteau wrote: Fri Nov 01, 2024 9:45 am
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.
Hello John,

I'm glad that some people are interested in my experiments! I was myself fascinated by reading the blogs of Jonatan Pettersson (Mediocre) and Thomas Petzke (iCE) a few years ago, and these two guys made me want to develop my own chess engine :)
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm As simple a network as possble (768->(32x2)->1 ???)
I would say that you can start with a simple network like 2x(768x128)x1: it will give you a decent result and a great speed.

But be aware that there's always a trade-off between speed and accuracy: bigger networks will improve accuracy and reduce speed, but at the same time the search will take advantage of this accuracy and you'll be able to prune more reliably, for example.

Point of attention: in my experiments, networks with two perspectives always gave me better results (and training stability), i.e. prefer 2x(768x128)x1 to 768x256x1 (maybe someone else could share its thoughts on that).
johnhamlen65 wrote: Thu Oct 31, 2024 6:22 pm Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game
Yes, that's the big difficulty: how to turn such "poor" labels (game results) into valuable information to train a strong network ;)

Connor McMonigle (Seer author) had a really clever idea about this (it's not really "tapering", but the idea remains to start from endgame positions, where labels are more reliable)!

David
If we made large neural networks then its more accurate but they are extrem slow down the speed, the main problem we had is the speed, so reduce number of neurons give us speed but they did not learning anything then and is not accurate for chess.

so what to do ? even 2x(768x256)x1 is extremly slow without any SIMD insturctions and quantization. Does it not enough to use 2x(768x8)x1 only ??
User avatar
towforce
Posts: 11988
Joined: Thu Mar 09, 2006 12:57 am
Location: Birmingham UK
Full name: . .

Re: Thoughts on "Zero" training of NNUEs

Post by towforce »

I haven't read the paper about Google's chess program that learned chess from zero knowledge by playing against itself, and I don't understand how it was even possible. How did it know it was getting better? Was it allowed to leave itself in check, and hence lose its king?

To undertake such a project, I would start with a simple game, and maybe use TensorFlow. This tutorial may be a bit too remedial, but it shows you how simple it is to work with the JavaScript version of TensorFlow - link.
The simple reveals itself after the complex has been exhausted.
op12no2
Posts: 525
Joined: Tue Feb 04, 2014 12:25 pm
Location: Gower, Wales
Full name: Colin Jenkins

Re: Thoughts on "Zero" training of NNUEs

Post by op12no2 »

Another way to train a NNUE style eval from zero is by playing games with a random init of the weights and using the results to train a new net; rinse, repeat. In the early stages the new nets win 100% of the games when tested against each other; it's fascinating to watch. Common practice is to add a data gen command to the engine, playing games using a node count rather than time.
chrisw
Posts: 4555
Joined: Tue Apr 03, 2012 4:28 pm
Location: Midi-Pyrénées
Full name: Christopher Whittington

Re: Thoughts on "Zero" training of NNUEs

Post by chrisw »

hgm wrote: Fri Nov 01, 2024 11:33 am I would expect to get better performance from networks of the same size when you adapt the network better to the type of information that we know is important in Chess. The networks that are now popular use king-piece-square tables, which is very good for determining king safety. Which in Shogi, from which the NNUE idea was copied, of course is the overwhelmingly dominant evaluation term.

In chess King Safety is also important, but Pawn structure might be even more important. With only king-piece-square inputs in the first layer recognizing other piece relations (e.g. passers) must be done indirectly, by the deeper layers of the network. Which puts a heavier burden there, and thus probably needs a larger network to perform it.

So it might be worth it to also include some tables that depend on the relation between Pawns and other pieces, and Pawns w.r.t. each other. Like Kings, Pawns are not very mobile either. So when some of the tables are pawn-piece-square rather than king-piece-square, most moves would still only have to update the input they provide to the next layer by accounting for the moved piece, rather than having to recalculate everything because the Pawn moved.
But how to bucket the pawns?
User avatar
towforce
Posts: 11988
Joined: Thu Mar 09, 2006 12:57 am
Location: Birmingham UK
Full name: . .

Re: Thoughts on "Zero" training of NNUEs

Post by towforce »

op12no2 wrote: Fri Nov 01, 2024 11:09 pm Another way to train a NNUE style eval from zero is by playing games with a random init of the weights and using the results to train a new net; rinse, repeat. In the early stages the new nets win 100% of the games when tested against each other; it's fascinating to watch. Common practice is to add a data gen command to the engine, playing games using a node count rather than time.

How do they win? Is the opponent allowed to leave their king in check? Getting a checkmate would seem to require more skill than a net with random weights would have.
The simple reveals itself after the complex has been exhausted.