I’m currently training a new neural network from scratch. With more powerful hardware and a lot more patience. When I implemented NNUE for Leorik 3.0, I used my large repository of labeled positions, originally accumulated for tuning the HCE, so even the first networks I trained were already quite strong.
But there’s something really fascinating about starting with no and no means to evaluate a position; no prior knowledge other than the basic rules of chess and a decent tree search algorithm, and then watching the engine improve purely through self-play.
Yeah, that's definitely very interesting! There were also thoughts about this, so as not to have a hce-"base", and to see clean learning.
I'm also implementing NNUE into my engine, but I haven't beaten HCE yet, as I probably don't have enough test data.
I'm still training new networks. The best net I got so far (v14) has 256 hidden layers and was trained on 1402M positions. I have about 2B positions by now but had to change my process a bit to deal with this large amount of data. The code that is supposed to shuffle all the position loads everything into memory, shuffles and then writes the shuffled positions into one big file. This is already tricky in .Net because there's a built in array size limit of 2GB. So I need to work with multiple buffers.
But now I have so many positions that they don't fit into my systems 64GB of RAM. Today I changed the code to work with slices of the data. I load a position then skip the next X positions. For the next slice I do the same but skip the first position in each file. Next I skip the first two, and so on until offset == X. That way I can split my data into equally sized slices that individually fit into memory.
In bullet I can sequentially load these slices and each slice contains roughly the same number of positions equally distributed over the entire set but without dublicates. So now my bottle neck is really only HD space and the time it takes to generate more positions from selfplay.
Minimal Chess (simple, open source, C#) - Youtube & Github Leorik (competitive, in active development, C#) - Github & Lichess
lithander wrote: ↑Sun Feb 16, 2025 5:41 pm
I'm still training new networks. The best net I got so far (v14) has 256 hidden layers and was trained on 1402M positions. I have about 2B positions by now but had to change my process a bit to deal with this large amount of data. The code that is supposed to shuffle all the position loads everything into memory, shuffles and then writes the shuffled positions into one big file. This is already tricky in .Net because there's a built in array size limit of 2GB. So I need to work with multiple buffers.
But now I have so many positions that they don't fit into my systems 64GB of RAM. Today I changed the code to work with slices of the data. I load a position then skip the next X positions. For the next slice I do the same but skip the first position in each file. Next I skip the first two, and so on until offset == X. That way I can split my data into equally sized slices that individually fit into memory.
In bullet I can sequentially load these slices and each slice contains roughly the same number of positions equally distributed over the entire set but without dublicates. So now my bottle neck is really only HD space and the time it takes to generate more positions from selfplay.
The shuffle functionality in bullet-utils will handle this by creating smaller temporary files, shuffling them individually and then interleaving them - this is equivalent to shuffling the whole file.
lithander wrote: ↑Sun Feb 16, 2025 5:41 pm
I'm still training new networks. The best net I got so far (v14) has 256 hidden layers and was trained on 1402M positions. I have about 2B positions by now but had to change my process a bit to deal with this large amount of data. The code that is supposed to shuffle all the position loads everything into memory, shuffles and then writes the shuffled positions into one big file. This is already tricky in .Net because there's a built in array size limit of 2GB. So I need to work with multiple buffers.
But now I have so many positions that they don't fit into my systems 64GB of RAM. Today I changed the code to work with slices of the data. I load a position then skip the next X positions. For the next slice I do the same but skip the first position in each file. Next I skip the first two, and so on until offset == X. That way I can split my data into equally sized slices that individually fit into memory.
In bullet I can sequentially load these slices and each slice contains roughly the same number of positions equally distributed over the entire set but without dublicates. So now my bottle neck is really only HD space and the time it takes to generate more positions from selfplay.
I'm just guessing. So you take a billion positions and evaluate them using the NN and compare the evaluation with the result of the game. And then modify the weights in the NN and run it again in hopes of getting a better correlation between evaluations and results. Is that anywhere close to being correct? Why not overlay ten million higher quality games into a tree structure on the hard-drive keeping track of wins draws and losses. The accumulated data would be smaller and persistent and adding more games would be easy. Then every training position won't just have one result associated with it but could have thousands of results associated with it. Does any of this make sense. Help me understand.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:15 pm
So you take a billion positions ... Why not overlay ten million higher quality games into a tree structure on the hard-drive
10 million games with, say, 80 half moves per game is 800 million positions... hardly much smaller. EDIT: even if organised into a tree structure, which will not reduce the size that much.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:15 pm
And then modify the weights in the NN and run it again in hopes of getting a better correlation between evaluations and results. Is that anywhere close to being correct?
"In hopes" is not a characterisation I would use... gradient descent & NN training methods on top of it are not some roll of a dice in terms of whether they work.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:15 pm
Why not overlay ten million higher quality games into a tree structure on the hard-drive keeping track of wins draws and losses. The accumulated data would be smaller and persistent and adding more games would be easy. Then every training position won't just have one result associated with it but could have thousands of results associated with it.
And how would this be used for evaluation of a position not in that tree? It is not clear what you are suggesting, other than requiring the unrealistic distribution of a massive amount of data to users of an engine.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:15 pm
So you take a billion positions ... Why not overlay ten million higher quality games into a tree structure on the hard-drive
10 million games with, say, 80 half moves per game is 800 million positions... hardly much smaller. EDIT: even if organised into a tree structure, which will not reduce the size that much.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:15 pm
And then modify the weights in the NN and run it again in hopes of getting a better correlation between evaluations and results. Is that anywhere close to being correct?
"In hopes" is not a characterisation I would use... gradient descent & NN training methods on top of it are not some roll of a dice in terms of whether they work.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:15 pm
Why not overlay ten million higher quality games into a tree structure on the hard-drive keeping track of wins draws and losses. The accumulated data would be smaller and persistent and adding more games would be easy. Then every training position won't just have one result associated with it but could have thousands of results associated with it.
And how would this be used for evaluation of a position not in that tree? It is not clear what you are suggesting, other than requiring the unrealistic distribution of a massive amount of data to users of an engine.
Why distribute the tree. It is just used to train the network. And having a thousand results w,d,l for a position can help train the weights even faster and better. And lines in the tree can be deleted at a blunder that does not match the result. The tree itself can be trained and then used to train the NN.
Mike Sherwin wrote: ↑Mon Feb 17, 2025 10:59 pm
Why distribute the tree. It is just used to train the network. And having a thousand results w,d,l for a position can help train the weights even faster and better. And lines in the tree can be deleted at a blunder that does not match the result. The tree itself can be trained and then used to train the NN.
It seemed to me that you were trying to suggest something distinct, but I misunderstood you. In that case though, you aren't suggesting anything particularly new other than a more compressed way to store the data: the one billion positions originally cited are not all distinct, there are repeats. And adding new positions to the dataset is trivial. Additionally, for your deleting lines at a blunder, there exist binpack formats that stores game records much like PGNs do already, so this can already be done.
JacquesRW wrote: ↑Mon Feb 17, 2025 1:34 am
The shuffle functionality in bullet-utils will handle this by creating smaller temporary files, shuffling them individually and then interleaving them - this is equivalent to shuffling the whole file.
I'm also running low on disk space so I'm glad my solution gets away without temporary files This is because I do the interleaving when I read the input into RAM. The result is not exactly the same as shuffling the whole file but not worse for training: Positions from each game are fairly distributed over the slices, slices are then shuffled!
Since I started using memory mapped files it takes 15 minutes to process 66GB of data into 3 slices. The actual shuffling takes 4 minutes, reading the files 3x times and writing the 3 shuffled slices to the SSD takes the bulk of the time. All in all it seems reasonably fast.
Minimal Chess (simple, open source, C#) - Youtube & Github Leorik (competitive, in active development, C#) - Github & Lichess
Over the past weeks I've continued to run selfplay games for my training run. I'm currently training with 160GB of labeled positions and will stop when I reach 5B positions from the 13th generation and later and then prepare the release of Leorik 3.1
The results of training NNUE net's from zero have vastly exceeded my expectatons. In the 16th generation a net trained with the same size and architecture was already vastly superior to the net Leorik 3 shipped with:
Score of Leorik-3.0.11v16 vs Leorik-3.0.11ref: 909 - 364 - 1201 [0.610] 2474
Elo difference: 77.8 +/- 9.8, LOS: 100.0 %, DrawRatio: 48.5 %
Since then I have started to use larger hidden layers (256 HL -> 640 HL : +30 Elo) and changed the activation function from CReLU to SCReLU (+17 Elo) which both improved the static evaluation further. But with these changes the new networks are no longer directly compareable to the reference.
I also managed to make a few search improvements and hope that the upcoming release will be significantly stronger than version 3.0 which released over a year ago and has since fallen about 20 places in the CCRL due to the strong competition in active development. I hope the update will let Leorik climb back in the top 100!
Minimal Chess (simple, open source, C#) - Youtube & Github Leorik (competitive, in active development, C#) - Github & Lichess