Best practices for NNUE training data generation?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

kelseyde123
Posts: 11
Joined: Fri Oct 06, 2023 1:10 am
Full name: Dan Kelsey

Best practices for NNUE training data generation?

Post by kelseyde123 »

I'm planning to convert my engine Calvin from HCE to NNUE, and I'm wondering what are the best practices for generating training data, as well as what qualifies as 'good' training data.
  • 1. How do people retrieve the data? My current plan was to write a script extracting individual positions from PGN files generated via cutechess-cli. Is this how others go about it?
  • 2. What positions to extract? Is it common to run a q-search to reach quiet positions, as is done for Texel tuners? Or is that not necessary for NNUE data?
  • 3. Is there a rule-of-thumb for how many positions to extract per game played? Presumably you don't want too many from the same game, to ensure a variety of positions/pawn structures etc.
  • 4. How many training positions to use? I'm sure the answer is 'the more the better', but I've seen it vary from only a few million to the billions in case of Stockfish... what would be a reasonable starting point?
Any help would be much appreciated!
mar
Posts: 2592
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Best practices for NNUE training data generation?

Post by mar »

I think 1 might work as well if you manage to extract scores from the pgn - extra labeling pass can be very slow
it's probably common (and more efficient) to generate positions directly in some internal packed binary format - like just play a couple of random plies then let the engine play on (fixed nodes, 5-10k) while generating positions - still generating hundreds of millions to billions of positions can take very long
and of course then there's leela data if you don't care about having an original eval...

as for 2 it should be enough to avoid positions where stm is in check or where bestmove is tactical (=capture) as I was told by Jay and it worked well for me

for 3 I don't do anything special

as for 4: probably depends on the strength of your engine. I initially failed with only 15M positions so I'd recommend 100M+ (of course the more the better but you don't need billions for initial NN version)
op12no2
Posts: 514
Joined: Tue Feb 04, 2014 12:25 pm
Full name: Colin Jenkins

Re: Best practices for NNUE training data generation?

Post by op12no2 »

Re: 1. pgn_extract is fast if you are happy using third party tools.
https://www.cs.kent.ac.uk/people/staff/djb/pgn-extract/
chesskobra
Posts: 235
Joined: Thu Jul 21, 2022 12:30 am
Full name: Chesskobra

Re: Best practices for NNUE training data generation?

Post by chesskobra »

mar wrote: Fri Jun 28, 2024 6:03 am I think 1 might work as well if you manage to extract scores from the pgn - extra labeling pass can be very slow
it's probably common (and more efficient) to generate positions directly in some internal packed binary format - like just play a couple of random plies then let the engine play on (fixed nodes, 5-10k) while generating positions - still generating hundreds of millions to billions of positions can take very long
and of course then there's leela data if you don't care about having an original eval...
Could you please point to the leela data that you are referring to? For the evaluations of positions, one may get CCRL commented games, where each position has evaluation. I am interested in the question as well, especially for training a network for endgame.
mar
Posts: 2592
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Best practices for NNUE training data generation?

Post by mar »

chesskobra wrote: Fri Jun 28, 2024 11:56 am Could you please point to the leela data that you are referring to? For the evaluations of positions, one may get CCRL commented games, where each position has evaluation. I am interested in the question as well, especially for training a network for endgame.
sorry I can't, I use my own data. but a google search might help hopefully?
smatovic
Posts: 2839
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Best practices for NNUE training data generation?

Post by smatovic »

chesskobra wrote: Fri Jun 28, 2024 11:56 am [...]
Could you please point to the leela data that you are referring to? For the evaluations of positions, one may get CCRL commented games, where each position has evaluation. I am interested in the question as well, especially for training a network for endgame.

https://github.com/official-stockfish/n ... g-datasets

https://robotmoon.com/nnue-training-data/

https://storage.lczero.org/files/training_data/

--
Srdja
chesskobra
Posts: 235
Joined: Thu Jul 21, 2022 12:30 am
Full name: Chesskobra

Re: Best practices for NNUE training data generation?

Post by chesskobra »

@smatovic Thanks. These data sets look very interesting.
jorose
Posts: 373
Joined: Thu Jan 22, 2015 3:21 pm
Location: Zurich, Switzerland
Full name: Jonathan Rosenthal

Re: Best practices for NNUE training data generation?

Post by jorose »

It should be pointed out, while it is generally fine to use Leela data, competitions such as TCEC and CCC prefer you train on your own data.

Before NNUE was popular there was a lot of diversity in engines which came naturally as a result of the development process behind handcrafted evaluation functions. Since most top engines use similar evaluation functions (read NNUE) now, the top events prefer you don't cut corners and come up with your own training data, as that can lead to your program being more unique. The Leela training data is both a huge dataset as well as very high quality, so not preferring people to come up with their own data would push the developers to all use that and make it hard for everyone coming up with their own data to compete.

I think we also don't yet properly understand what makes for good training data. It is easy to assume that higher quality play leads to higher quality training data, except the reality is we have direct evidence to the contrary. As far as I know, the only thing we know for certain is that more data tends to be better than less data. As such, it is also very valuable that we have people exploring dataset generation.

In general my recommendation for almost everything in computer chess is to just be transparent. This applies to both data and code. Nobody likes finding out after the fact that you chose to use publicly available tools without disclosing them, but if you use things that are publicly available and give credit to the sources, generally everyone is happy. The events can choose whom to invite, the people making the tools available are happy to see the tools are useful, people might find your project because they were interested in those tools, etc.
-Jonathan