Generating original training data

StuProgrammer · Post by **StuProgrammer** » Fri Feb 07, 2025 12:05 pm

I've been developing various versions of my chess engine over 3 years, and I have finally introduced a NNUE-style network into my engine, for a ~200 elo improvement over HCE, however I have now hit a roadblock.

The problem I am having is that I used the stockfish evals from lichess https://database.lichess.org/#evals, which seems to be a little unoriginal.
The next best option I can see is playing out many games between the old HCE versions of my engine for training data, but my simple calculations show that half an hour of parsing lichess data is equivalent to about 300 hours(!) of running test matches at 50ms/move. I think that makes generating a good dataset near-impossible.

Are there any strategies for collecting training data more quickly, for example

using positions reached during the search

using human games

I’d love to hear your thoughts, experiences, or suggestions on how to tackle this problem.
Thanks in advance!

Sapling · Post by **Sapling** » Fri Feb 07, 2025 2:55 pm

Congrats on getting the inference working!

The honest truth is that it does take a long time, that's why many end up using Leela's data set.

I logged the datagen / training sessions of my engine here
I decided to initialize my network with random weights and then train through self play with no HCE. The first 5-10 iterations only took a few hours each and gave massive ELO improvements, It was really quite a fun time.

The next 10 iterations were much much slower, taking days to weeks between networks. Although as your dataset gets better you can start reusing data.

A few tips:
Write your own program for datagen that uses direct method calls instead of UCI.
Start with 7 or 8 random moves
Adjudicate a win based on a huge delta eval for a number of moves
Set a hard node limit for your search e.g 5k nodes
Here is my datagen program in case it's helpful. IIRC it was generating around 4k positions / second per machine but YMMV

sovaz1997 · Post by **sovaz1997** » Sat Feb 08, 2025 1:50 am

Definitely! Training without HCE is really cool!

I am now at the stage where I have caught up with the HCE level of play after 7-8 iterations with net size 768x32. I also generated games, the first 10-12 moves are random, the rest are saved. So, I can generate about 100M positions in some hours.

I also filtered positions (so that there is no capture best move)

jdart · Post by **jdart** » Sun Feb 09, 2025 4:01 am

I have several large servers I used for generating training games. At one point I was using over 100 cores. It is certainly resource intensive, much more so than training

phhnguyen · Post by **phhnguyen** » Sun Feb 09, 2025 8:36 am

jdart wrote: ↑Sun Feb 09, 2025 4:01 am I have several large servers I used for generating training games. At one point I was using over 100 cores. It is certainly resource intensive, much more so than training

Interesting information. Could you tell us how large (#items) and long (days) you run those 100 cores for generating a "good"/typical dataset? Thanks

lithander · Post by **lithander** » Mon Feb 10, 2025 12:35 pm

I'm currently in the process of training a new network from scratch for my engine Leorik and wrote about it here: viewtopic.php?p=975507#p975507

Generally when you're still at the HCE stage you can usually improve on that version with a very small network and a very small dataset. I remember that I had a tiny network with only 8 neurons in the hidden layer that was already en par with the HCE version.

To train such a small network doesn't require more then a few million labeled positions. One billion seems about enough to train pretty good 256 hidden layer nets. So every hobby programmer can get a a big Elo jump from moving to NNUE without relying on an external dataset.

Every developer has to decide how much money/time they want to invest or whether they are fine with taking shortcuts by using external datasets, which most devs frown upon. If you are not in the business of competing with the top 10 engines you can generate a few thousand positions per second on your home computer while you do other things for a few weeks and you're good.

Sapling wrote: ↑Fri Feb 07, 2025 2:55 pm I logged the datagen / training sessions of my engine here

Interesting! I would have also loved to see how much performance each net gains over the predecessor, though!

Personally I've not yet started lowering WDL below 1.0... I feel like training just on WDL is purer then assigning a somewhat subjective eval to a position. What's your experience with the WDL ratio?

Generating original training data

Generating original training data

Re: Generating original training data

Re: Generating original training data

Re: Generating original training data

Re: Generating original training data

Re: Generating original training data