Thoughts on "Zero" training of NNUEs

johnhamlen65 · Post by **johnhamlen65** » Thu Oct 31, 2024 6:22 pm

Dear Computer Chess Connoisseurs,

After getting a complete spanking in Santiago de Compostela with my homage the the original technology chess programs (https://www.chessprogramming.org/Tech), I've returned home licking my wounds, but also inspired to to create a new chess engine focused on performance rather than romantic historical accuracy! Of course this now means implementing a NNUE - rather than another material-only - evaluation function.

The idea of initially training the network on positionsl evaluated by Stockfish, LC0, or any other engine (even my own), doesn't inspire me very mush. However, the idea of following in AlphaZero's footsteps, and starting tabula rasa really does!

I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.

Not having 5000 TPUs to work with, I feel that at least two things are essential:

As simple a network as possble (768->(32x2)->1 ???)

Higher quality labels, even working with just 0, 0.5, and 1

For the first, does anyone have a feel for when we hit diminishing returns in terms of network size?

For the second, it occurs to me that, in the extreme, the starting position adds nothing to the training data, whereas positions near the end of the game add a lot, and positions in the middle game, somewhere inbetween. For instance, I'd imagine that the labels for most rook and pawn endgames could be of very high quality (mostly 0.5).

Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game, whether this be athe start of a stunning, game-winning combination in the middlegame (label=1 or -1), or a position heading towards an insufficient-material draw in an endgame.

Many thanks for your thoughts

John

Engin · Post by **Engin** » Thu Oct 31, 2024 7:44 pm

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm Dear Computer Chess Connoisseurs,

After getting a complete spanking in Santiago de Compostela with my homage the the original technology chess programs (https://www.chessprogramming.org/Tech), I've returned home licking my wounds, but also inspired to to create a new chess engine focused on performance rather than romantic historical accuracy! Of course this now means implementing a NNUE - rather than another material-only - evaluation function.

The idea of initially training the network on positionsl evaluated by Stockfish, LC0, or any other engine (even my own), doesn't inspire me very mush. However, the idea of following in AlphaZero's footsteps, and starting tabula rasa really does!

I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.

Not having 5000 TPUs to work with, I feel that at least two things are essential:
As simple a network as possble (768->(32x2)->1 ???)

Higher quality labels, even working with just 0, 0.5, and 1
For the first, does anyone have a feel for when we hit diminishing returns in terms of network size?

For the second, it occurs to me that, in the extreme, the starting position adds nothing to the training data, whereas positions near the end of the game add a lot, and positions in the middle game, somewhere inbetween. For instance, I'd imagine that the labels for most rook and pawn endgames could be of very high quality (mostly 0.5).

Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game, whether this be athe start of a stunning, game-winning combination in the middlegame (label=1 or -1), or a position heading towards an insufficient-material draw in an endgame.

Many thanks for your thoughts
John

Hi John,
to understanding NNUE is very hard, i am my self did not even understand them all how can be so large inputs 2x10x64x64 (= 81920) ->2x256->32->32->1 can fast working for the AB search algorithm, yes i know about SIMD, but get not really the speed what i am try to do something different with my own neural network with smaller size of networks like 2x768 for each side an then just only 2x8 to 1 output, the network take a lot of time to learn something, well bigger sizes including the king square drops down a lot. i got a little bit speedup of x10 with OpenMP SIMD for the forward function now, let's see.

Of course you can get a lot more speed with quantize the model from floating point to integer later after the training prozess like the NNUE are doing. Training with integer's are not possible :/

David Carteau · Post by **David Carteau** » Fri Nov 01, 2024 9:45 am

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.

Hello John,

I'm glad that some people are interested in my experiments! I was myself fascinated by reading the blogs of Jonatan Pettersson (Mediocre) and Thomas Petzke (iCE) a few years ago, and these two guys made me want to develop my own chess engine

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm As simple a network as possble (768->(32x2)->1 ???)

I would say that you can start with a simple network like 2x(768x128)x1: it will give you a decent result and a great speed.

But be aware that there's always a trade-off between speed and accuracy: bigger networks will improve accuracy and reduce speed, but at the same time the search will take advantage of this accuracy and you'll be able to prune more reliably, for example.

Point of attention: in my experiments, networks with two perspectives always gave me better results (and training stability), i.e. prefer 2x(768x128)x1 to 768x256x1 (maybe someone else could share its thoughts on that).

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game

Yes, that's the big difficulty: how to turn such "poor" labels (game results) into valuable information to train a strong network

Connor McMonigle (Seer author) had a really clever idea about this (it's not really "tapering", but the idea remains to start from endgame positions, where labels are more reliable)!

David

hgm · Post by **hgm** » Fri Nov 01, 2024 11:33 am

I would expect to get better performance from networks of the same size when you adapt the network better to the type of information that we know is important in Chess. The networks that are now popular use king-piece-square tables, which is very good for determining king safety. Which in Shogi, from which the NNUE idea was copied, of course is the overwhelmingly dominant evaluation term.

In chess King Safety is also important, but Pawn structure might be even more important. With only king-piece-square inputs in the first layer recognizing other piece relations (e.g. passers) must be done indirectly, by the deeper layers of the network. Which puts a heavier burden there, and thus probably needs a larger network to perform it.

So it might be worth it to also include some tables that depend on the relation between Pawns and other pieces, and Pawns w.r.t. each other. Like Kings, Pawns are not very mobile either. So when some of the tables are pawn-piece-square rather than king-piece-square, most moves would still only have to update the input they provide to the next layer by accounting for the moved piece, rather than having to recalculate everything because the Pawn moved.

CRoberson · Post by **CRoberson** » Fri Nov 01, 2024 3:55 pm

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm
...

For the second, it occurs to me that, in the extreme, the starting position adds nothing to the training data, whereas positions near the end of the game add a lot, and positions in the middle game, somewhere inbetween. For instance, I'd imagine that the labels for most rook and pawn endgames could be of very high quality (mostly 0.5).

Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game, whether this be athe start of a stunning, game-winning combination in the middlegame (label=1 or -1), or a position heading towards an insufficient-material draw in an endgame.

Hi John,
Great meeting you and Carol in Spain! You are on the right track with that idea. There are papers written on TDL (Temporal Difference Learning) from the late 1980s and early 1990s. The ones to read are by Gerald Tessauro of the IBM Thomas J. Watson research center on Backgammon. He has papers on using neural nets in the normal fashion and on TDL. TDL was better. The idea is based on the fact that in at least long games the likelyhood that the first several moves are the problem is far smaller than a move(s) closer to the end of the game.

Regards,
Charles

Engin · Post by **Engin** » Fri Nov 01, 2024 4:16 pm

David Carteau wrote: ↑Fri Nov 01, 2024 9:45 am
johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm I'm fascinated by David Carteau's memberlist.php?mode=viewprofile&u=7470 experiments and positive results with Orion https://www.orionchess.com/, but would love some thoughts and advice on how I can best approach this.
Hello John,

I'm glad that some people are interested in my experiments! I was myself fascinated by reading the blogs of Jonatan Pettersson (Mediocre) and Thomas Petzke (iCE) a few years ago, and these two guys made me want to develop my own chess engine

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm As simple a network as possble (768->(32x2)->1 ???)
I would say that you can start with a simple network like 2x(768x128)x1: it will give you a decent result and a great speed.

But be aware that there's always a trade-off between speed and accuracy: bigger networks will improve accuracy and reduce speed, but at the same time the search will take advantage of this accuracy and you'll be able to prune more reliably, for example.

Point of attention: in my experiments, networks with two perspectives always gave me better results (and training stability), i.e. prefer 2x(768x128)x1 to 768x256x1 (maybe someone else could share its thoughts on that).

johnhamlen65 wrote: ↑Thu Oct 31, 2024 6:22 pm Has anyone considered "tapering" the contributions of the training positions so more weight is given to positions near the end of the game
Yes, that's the big difficulty: how to turn such "poor" labels (game results) into valuable information to train a strong network

Connor McMonigle (Seer author) had a really clever idea about this (it's not really "tapering", but the idea remains to start from endgame positions, where labels are more reliable)!

David

If we made large neural networks then its more accurate but they are extrem slow down the speed, the main problem we had is the speed, so reduce number of neurons give us speed but they did not learning anything then and is not accurate for chess.

so what to do ? even 2x(768x256)x1 is extremly slow without any SIMD insturctions and quantization. Does it not enough to use 2x(768x8)x1 only ??

towforce · Post by **towforce** » Fri Nov 01, 2024 9:05 pm

I haven't read the paper about Google's chess program that learned chess from zero knowledge by playing against itself, and I don't understand how it was even possible. How did it know it was getting better? Was it allowed to leave itself in check, and hence lose its king?

To undertake such a project, I would start with a simple game, and maybe use TensorFlow. This tutorial may be a bit too remedial, but it shows you how simple it is to work with the JavaScript version of TensorFlow - link.

op12no2 · Post by **op12no2** » Fri Nov 01, 2024 11:09 pm

Another way to train a NNUE style eval from zero is by playing games with a random init of the weights and using the results to train a new net; rinse, repeat. In the early stages the new nets win 100% of the games when tested against each other; it's fascinating to watch. Common practice is to add a data gen command to the engine, playing games using a node count rather than time.

chrisw · Post by **chrisw** » Sat Nov 02, 2024 9:14 am

hgm wrote: ↑Fri Nov 01, 2024 11:33 am I would expect to get better performance from networks of the same size when you adapt the network better to the type of information that we know is important in Chess. The networks that are now popular use king-piece-square tables, which is very good for determining king safety. Which in Shogi, from which the NNUE idea was copied, of course is the overwhelmingly dominant evaluation term.

In chess King Safety is also important, but Pawn structure might be even more important. With only king-piece-square inputs in the first layer recognizing other piece relations (e.g. passers) must be done indirectly, by the deeper layers of the network. Which puts a heavier burden there, and thus probably needs a larger network to perform it.

So it might be worth it to also include some tables that depend on the relation between Pawns and other pieces, and Pawns w.r.t. each other. Like Kings, Pawns are not very mobile either. So when some of the tables are pawn-piece-square rather than king-piece-square, most moves would still only have to update the input they provide to the next layer by accounting for the moved piece, rather than having to recalculate everything because the Pawn moved.

But how to bucket the pawns?

towforce · Post by **towforce** » Sat Nov 02, 2024 10:09 am

op12no2 wrote: ↑Fri Nov 01, 2024 11:09 pm Another way to train a NNUE style eval from zero is by playing games with a random init of the weights and using the results to train a new net; rinse, repeat. In the early stages the new nets win 100% of the games when tested against each other; it's fascinating to watch. Common practice is to add a data gen command to the engine, playing games using a node count rather than time.

How do they win? Is the opponent allowed to leave their king in check? Getting a checkmate would seem to require more skill than a net with random weights would have.

Thoughts on "Zero" training of NNUEs

Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs