Generating original training data

Discussion of chess software programming and technical issues.

Moderator: Ras

StuProgrammer
Posts: 1
Joined: Fri Jan 03, 2025 11:06 pm
Full name: Stuart Marshall

Generating original training data

Post by StuProgrammer »

I've been developing various versions of my chess engine over 3 years, and I have finally introduced a NNUE-style network into my engine, for a ~200 elo improvement over HCE, however I have now hit a roadblock.

The problem I am having is that I used the stockfish evals from lichess https://database.lichess.org/#evals, which seems to be a little unoriginal.
The next best option I can see is playing out many games between the old HCE versions of my engine for training data, but my simple calculations show that half an hour of parsing lichess data is equivalent to about 300 hours(!) of running test matches at 50ms/move. I think that makes generating a good dataset near-impossible.

Are there any strategies for collecting training data more quickly, for example
  • using positions reached during the search
  • using human games
I’d love to hear your thoughts, experiences, or suggestions on how to tackle this problem.
Thanks in advance!
Sapling
Posts: 18
Joined: Sun Oct 13, 2024 7:31 pm
Location: UK
Full name: Tim Jones

Re: Generating original training data

Post by Sapling »

Congrats on getting the inference working!

The honest truth is that it does take a long time, that's why many end up using Leela's data set.

I logged the datagen / training sessions of my engine here
I decided to initialize my network with random weights and then train through self play with no HCE. The first 5-10 iterations only took a few hours each and gave massive ELO improvements, It was really quite a fun time.

The next 10 iterations were much much slower, taking days to weeks between networks. Although as your dataset gets better you can start reusing data.

A few tips:
Write your own program for datagen that uses direct method calls instead of UCI.
Start with 7 or 8 random moves
Adjudicate a win based on a huge delta eval for a number of moves
Set a hard node limit for your search e.g 5k nodes
Here is my datagen program in case it's helpful. IIRC it was generating around 4k positions / second per machine but YMMV
The Grand Chess Tree - Distributed volunteer computing project
Sapling - 3380 ELO [CCRL] UCI chess engine
sovaz1997
Posts: 289
Joined: Sun Nov 13, 2016 10:37 am

Re: Generating original training data

Post by sovaz1997 »

Definitely! Training without HCE is really cool!

I am now at the stage where I have caught up with the HCE level of play after 7-8 iterations with net size 768x32. I also generated games, the first 10-12 moves are random, the rest are saved. So, I can generate about 100M positions in some hours.

I also filtered positions (so that there is no capture best move)
Zevra 2 is my chess engine. Binary, source and description here: https://github.com/sovaz1997/Zevra2
Zevra v2.6 is last version of Zevra: https://github.com/sovaz1997/Zevra2/releases
jdart
Posts: 4391
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Generating original training data

Post by jdart »

I have several large servers I used for generating training games. At one point I was using over 100 cores. It is certainly resource intensive, much more so than training
User avatar
phhnguyen
Posts: 1514
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Generating original training data

Post by phhnguyen »

jdart wrote: Sun Feb 09, 2025 4:01 am I have several large servers I used for generating training games. At one point I was using over 100 cores. It is certainly resource intensive, much more so than training
Interesting information. Could you tell us how large (#items) and long (days) you run those 100 cores for generating a "good"/typical dataset? Thanks
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
lithander
Posts: 898
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Generating original training data

Post by lithander »

I'm currently in the process of training a new network from scratch for my engine Leorik and wrote about it here: viewtopic.php?p=975507#p975507

Generally when you're still at the HCE stage you can usually improve on that version with a very small network and a very small dataset. I remember that I had a tiny network with only 8 neurons in the hidden layer that was already en par with the HCE version.

To train such a small network doesn't require more then a few million labeled positions. One billion seems about enough to train pretty good 256 hidden layer nets. So every hobby programmer can get a a big Elo jump from moving to NNUE without relying on an external dataset.

Every developer has to decide how much money/time they want to invest or whether they are fine with taking shortcuts by using external datasets, which most devs frown upon. If you are not in the business of competing with the top 10 engines you can generate a few thousand positions per second on your home computer while you do other things for a few weeks and you're good.
Sapling wrote: Fri Feb 07, 2025 2:55 pm I logged the datagen / training sessions of my engine here
Interesting! I would have also loved to see how much performance each net gains over the predecessor, though!

Personally I've not yet started lowering WDL below 1.0... I feel like training just on WDL is purer then assigning a somewhat subjective eval to a position. What's your experience with the WDL ratio?
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess