Thoughts on "Zero" training of NNUEs

op12no2 · Post by **op12no2** » Sat Nov 02, 2024 10:31 am

towforce wrote: ↑Sat Nov 02, 2024 10:09 am
op12no2 wrote: ↑Fri Nov 01, 2024 11:09 pm Another way to train a NNUE style eval from zero is by playing games with a random init of the weights and using the results to train a new net; rinse, repeat. In the early stages the new nets win 100% of the games when tested against each other; it's fascinating to watch. Common practice is to add a data gen command to the engine, playing games using a node count rather than time.
How do they win? Is the opponent allowed to leave their king in check? Getting a checkmate would seem to require more skill than a net with random weights would have.

Yes, the games played with the random init were all draws as far as I could see. But those results were enough to train a net that could win 100% of the time against the random init net using adjudication to score wins/losses as well as mates, which as you say are probably few in early iterations. I did this as an experiment, but there are very strong engines that have been booted from a random init, like Ciecke's Stormphrax.

Edit: it didn't cross my mind before, but I guess one could argue that adjudication breaks the zero philosophy. In practice I don't think that would worry me, since one is still developing a net that is independent engine/data-wise.

hgm · Post by **hgm** » Sat Nov 02, 2024 12:23 pm

chrisw wrote: ↑Sat Nov 02, 2024 9:14 amBut how to bucket the pawns?

As I understood it, 'bucketing' means that some of the locations of a piece are considered the same, in order to reduce the number of independent weights in the first layer. In general applied to the King location in the KPST.

I suppose that for Pawns you should not do that. My feeling is that you can only get away for doing it with Kings because there are some areas on the board a King would never visit when King Safety is still an important evaluation term. So that you in practice only need the accurate relative placement of King and pieces that is needed for King Safety calculations when the King is at or near his castled locatations.

I would be surprised if bucketing was used in NNUE Shogi engines. In Shogi King Safety is always the most important evaluation term, and Kings are frequently chased all over the board through check drops.

hgm · Post by **hgm** » Sat Nov 02, 2024 12:29 pm

op12no2 wrote: ↑Sat Nov 02, 2024 10:31 am
towforce wrote: ↑Sat Nov 02, 2024 10:09 am
op12no2 wrote: ↑Fri Nov 01, 2024 11:09 pm Another way to train a NNUE style eval from zero is by playing games with a random init of the weights and using the results to train a new net; rinse, repeat. In the early stages the new nets win 100% of the games when tested against each other; it's fascinating to watch. Common practice is to add a data gen command to the engine, playing games using a node count rather than time.
How do they win? Is the opponent allowed to leave their king in check? Getting a checkmate would seem to require more skill than a net with random weights would have.
Yes, the games played with the random init were all draws as far as I could see. But those results were enough to train a net that could win 100% of the time against the random init net using adjudication to score wins/losses as well as mates, which as you say are probably few in early iterations. I did this as an experiment, but there are very strong engines that have been booted from a random init, like Ciecke's Stormphrax.

Edit: it didn't cross my mind before, but I guess one could argue that adjudication breaks the zero philosophy. In practice I don't think that would worry me, since one is still developing a net that is independent engine/data-wise.

As I recall it there still is a fair share of checkmates, in games between random movers. (See e.g. https://wismuth.com/chess/random-games.html : 15%).

op12no2 · Post by **op12no2** » Sat Nov 02, 2024 12:51 pm

hgm wrote: ↑Sat Nov 02, 2024 12:29 pm As I recall it there still is a fair share of checkmates, in games between random movers. (See e.g. https://wismuth.com/chess/random-games.html : 15%).

Interesting. The random init boot net scenario is different I guess in that the engine is searching as usual but with a very strange (but consistent) eval and maybe that's enough to force the games into a draw. But I must confess I didn't look too hard at the results; just quickly scanned through the top few thousand fens and thought 'all draws', but there may well have been some mates in there. While I seeded the games with 10 ply of random moves and 50% black/white to move first, a 6000 soft node search was used thereafter with adjudication at a |score| of 2000.

In early iterations you get some dodgy data like a relatively high score and a drawn game result because the nets cannot mate with rook and king for example, but it all seems to sort itself out in later iterations.

The downside of this approach is the time it takes for data generation when the nets become strong; some devs aborting and moving to Leela data for example.

hgm · Post by **hgm** » Sat Nov 02, 2024 1:31 pm

I agree that a random evaluation could be different from a random mover, because there is some consistency in the move choice, rather than this being independent.

OTOH, evaluation is not used at all for scoring checkmates. A random mover only has a small chance to pick the mating move if a mate-in-1 opportunity presents itself. An engine that searches 1 ply with random evaluation would always see it. But of course when the engines search two ply, they would recognize the threat, and usually be able to avert it.

It would be interesting to try a random mover that searches for checkmates, and randomly selects a move in other cases.

towforce · Post by **towforce** » Sat Nov 02, 2024 2:27 pm

hgm wrote: ↑Sat Nov 02, 2024 12:29 pmAs I recall it there still is a fair share of checkmates, in games between random movers. (See e.g. https://wismuth.com/chess/random-games.html : 15%).

Thanks. I wasn't aware of that: 15% is a lot higher than I would expect.

jdart · Post by **jdart** » Sat Nov 02, 2024 3:05 pm

It is important to have enough training data to train all parts of the network. For example, if your data never has or seldom has a king on f6, that part of the feature transformer is not being trained. At least a couple billion training positions is a good number. In general, larger nets require more training data.

You should think not just about the network, but how you will train it. Stockfish includes a trainer built on Pytorch and Lightning, but it is fairly tied to the Stockfish network architecture, although in principle it could be adapted to other architectures. There are several other trainers, one popular one is bullet: https://github.com/jw1912/bullet. See also https://github.com/Luecx/Grapheus, https://github.com/bmdanielsson/nnue-trainer.

You can generate data with selfplay games, but unless you have one or more large machines with many cores, that can be very slow. As noted, some are using LC0 data, available here: https://storage.lczero.org/files/training_data/, but it must be transformed to be useful for training. See https://github.com/linrock/lc0-data-converter, https://github.com/PGG106/Primer (converts binpack to bullet).

towforce · Post by **towforce** » Sun Nov 03, 2024 9:55 am

jdart wrote: ↑Sat Nov 02, 2024 3:05 pmIt is important to have enough training data to train all parts of the network. For example, if your data never has or seldom has a king on f6, that part of the feature transformer is not being trained. At least a couple billion training positions is a good number. In general, larger nets require more training data.

Good post - thank you.

This is supportive of my opinion of current chess NNs, which is that they are encoding a large number of surface/simple patterns rather than what you'd want in an ideal world - a small number of deep/complex patterns that would give them a good understanding of the game in a small model that would run quickly.

chrisw · Post by **chrisw** » Sun Nov 03, 2024 4:10 pm

hgm wrote: ↑Sat Nov 02, 2024 12:23 pm
chrisw wrote: ↑Sat Nov 02, 2024 9:14 amBut how to bucket the pawns?
As I understood it, 'bucketing' means that some of the locations of a piece are considered the same, in order to reduce the number of independent weights in the first layer. In general applied to the King location in the KPST.

I suppose that for Pawns you should not do that. My feeling is that you can only get away for doing it with Kings because there are some areas on the board a King would never visit when King Safety is still an important evaluation term. So that you in practice only need the accurate relative placement of King and pieces that is needed for King Safety calculations when the King is at or near his castled locatations.

I would be surprised if bucketing was used in NNUE Shogi engines. In Shogi King Safety is always the most important evaluation term, and Kings are frequently chased all over the board through check drops.

You can and some do, use 64 buckets for the king, one per square. So not necessarily a term that implies grouping some king squares together. It’s a lot tougher though to imagine up how to bucket the pawns (if you used rank of most advanced pawn, that would be 36(?) buckets, for example. I guess you could lump in some king bucketing also, but the increase in weights begins to get scary. As to the rarity of certain positions/structures, SF takes care of that, presumably others too, by some trickery or other.

hgm · Post by **hgm** » Sun Nov 03, 2024 9:51 pm

I see the usual KPST weights as a set of piece-pair tables, which provides a value W[coloredType1][sqr1][coloredType2][sqr2], added for all pairs present on the board. (For 32 pieces that would be 32x31/2 pairs.) Except that the weights are only non-zero if coloredType1 is a King. What I propose is to also use non-zero weights when coloredType1 is a Pawn. I think the standard size for NNUE is to use 256 KPST weight sets; this could be reduced to 128 tables for the kings, and 16 for the pawns.

The Pawn tables would of course only need 48 buckets.

The number of square pairs that have to be fed to the input layer then equal, because there are 16 Pawns, and only 2 Kings.

Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs

Re: Thoughts on "Zero" training of NNUEs