I'm planning to convert my engine Calvin from HCE to NNUE, and I'm wondering what are the best practices for generating training data, as well as what qualifies as 'good' training data.
1. How do people retrieve the data? My current plan was to write a script extracting individual positions from PGN files generated via cutechess-cli. Is this how others go about it?
2. What positions to extract? Is it common to run a q-search to reach quiet positions, as is done for Texel tuners? Or is that not necessary for NNUE data?
3. Is there a rule-of-thumb for how many positions to extract per game played? Presumably you don't want too many from the same game, to ensure a variety of positions/pawn structures etc.
4. How many training positions to use? I'm sure the answer is 'the more the better', but I've seen it vary from only a few million to the billions in case of Stockfish... what would be a reasonable starting point?
I think 1 might work as well if you manage to extract scores from the pgn - extra labeling pass can be very slow
it's probably common (and more efficient) to generate positions directly in some internal packed binary format - like just play a couple of random plies then let the engine play on (fixed nodes, 5-10k) while generating positions - still generating hundreds of millions to billions of positions can take very long
and of course then there's leela data if you don't care about having an original eval...
as for 2 it should be enough to avoid positions where stm is in check or where bestmove is tactical (=capture) as I was told by Jay and it worked well for me
for 3 I don't do anything special
as for 4: probably depends on the strength of your engine. I initially failed with only 15M positions so I'd recommend 100M+ (of course the more the better but you don't need billions for initial NN version)
mar wrote: ↑Fri Jun 28, 2024 6:03 am
I think 1 might work as well if you manage to extract scores from the pgn - extra labeling pass can be very slow
it's probably common (and more efficient) to generate positions directly in some internal packed binary format - like just play a couple of random plies then let the engine play on (fixed nodes, 5-10k) while generating positions - still generating hundreds of millions to billions of positions can take very long
and of course then there's leela data if you don't care about having an original eval...
Could you please point to the leela data that you are referring to? For the evaluations of positions, one may get CCRL commented games, where each position has evaluation. I am interested in the question as well, especially for training a network for endgame.
chesskobra wrote: ↑Fri Jun 28, 2024 11:56 am
Could you please point to the leela data that you are referring to? For the evaluations of positions, one may get CCRL commented games, where each position has evaluation. I am interested in the question as well, especially for training a network for endgame.
sorry I can't, I use my own data. but a google search might help hopefully?
chesskobra wrote: ↑Fri Jun 28, 2024 11:56 am
[...]
Could you please point to the leela data that you are referring to? For the evaluations of positions, one may get CCRL commented games, where each position has evaluation. I am interested in the question as well, especially for training a network for endgame.
It should be pointed out, while it is generally fine to use Leela data, competitions such as TCEC and CCC prefer you train on your own data.
Before NNUE was popular there was a lot of diversity in engines which came naturally as a result of the development process behind handcrafted evaluation functions. Since most top engines use similar evaluation functions (read NNUE) now, the top events prefer you don't cut corners and come up with your own training data, as that can lead to your program being more unique. The Leela training data is both a huge dataset as well as very high quality, so not preferring people to come up with their own data would push the developers to all use that and make it hard for everyone coming up with their own data to compete.
I think we also don't yet properly understand what makes for good training data. It is easy to assume that higher quality play leads to higher quality training data, except the reality is we have direct evidence to the contrary. As far as I know, the only thing we know for certain is that more data tends to be better than less data. As such, it is also very valuable that we have people exploring dataset generation.
In general my recommendation for almost everything in computer chess is to just be transparent. This applies to both data and code. Nobody likes finding out after the fact that you chose to use publicly available tools without disclosing them, but if you use things that are publicly available and give credit to the sources, generally everyone is happy. The events can choose whom to invite, the people making the tools available are happy to see the tools are useful, people might find your project because they were interested in those tools, etc.