Devlog of Leorik

lithander · Post by **lithander** » Mon Jan 08, 2024 5:56 pm

In the last weeks (thanks to the Christmas vacation) I have found more time for chess programming than usual.

A while ago I have talked about how Leorik's evaluation is tuned on self play data. So far the process was to actually have two instances of my engine play a match using cutechess. I would then parse the resulting PGN files to extract positions and assign the outcome of the game as a label.

With my work in the bigdata branch I now have a datagen executable that doesn't rely on cutechess and verbose text files anymore. It runs the selfplay games internally and serializes the playouts into a binary file. I adopted the Marlinflow Format which saves me a bit of space compared to a txt log (but text is only 60% larger) and is quicker to parse. I can load millions of labeled positions in main-memory but when I convert them into feature vectors for my tuning the memory requirement per position is a lot higher (~12K floats per position) so I can't tune on all the labeled positions that I have at the same time. I've dealt with this by loading only 4M positions into memory and after each tuning iteration I swap out a percentage of them.

But I still use the entirety of positions to compute the mean squared error (MSE). So you could say that to prevent over-fitting to the selected positions the rest serves as a validation set. I only update my best coefficients when after a tuning iteration the MSE on the total set has improved. That means that if I start from scratch at the beginnings als iterations produce an improvement and in the end after 100 iterations or more only one in 20. But these improvements are always legitimate and not due to over-fitting.

Computing the MSE on positions that are not converted to a feature vector means that I have to use the Leorik's own evaluation for this. I do this by copying over the coefficients on the fly. A nice side effect is, that this provides a more accurate prediction of the "real" evaluation.

The tuning process has become a lot slower from all this (mostly from generating gigabytes of feature vectors after each iteration) but produces good results from scratch consistently!

I work with around 50M positions currently and adding more data just doesn't seem to improve my tuning results any further. I assume my eval approach is is saturated. But that I can derive these strong weights from my newly generated data tells me that the new integrated datagen and filtering approach is as viable as running selfplay games in cutechess was.

In other words I feel ready to transition Leorik to a NNUE based eval. The steps I plan to take here are:

1.) take an existing net and frankenstein it onto Leorik to figure out the inference part
2.) train a small net using my own data based on 100% HCE Leorik selfplay (see above)
3.) use this neural net powered version of Leorik to generate more data and transition to larger architectures

I have already started with step one and can say that NNUE is less complicated than I feared.

I have chosen Stormphrax net010 as a starting point. https://github.com/Ciekce/Stormphrax/tr ... 6/src/eval

If you understand the general concept behind a simple neural network with one hidden layer and you think about how to adopt that to evaluate chess positions then a lot of the implementation details seem logical.

Like: Why 768 inputs features? Well, that's just 6x2x64... we have 6 pieces per color and 64 squares and which piece is sitting on which square is what you have to consider when you want to evaluate a position. Why the accumulator? Because the matrix is sparse (contains lots of zero features) and it allows you to save a lot of pointless multiplications that you know will give zero. Why shorts instead of floats? The features are all 1 or 0 so floating point operations are not really necessary. You add weights up and if the weights can be described as shorts you save space and gain performance...

I would have trouble as an absolute beginner to get the concepts but from where I am with Leorik at the moment that evaluation approach feels like the logical next step.

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Jan 10, 2024 3:35 pm

Hi Thomas,

Interesting to read, even for non-programmers.
I wish you good luck with neural networks.

Loud thinking, hope you don't mind ...
It would be great if your old classic-eval does not get lost.
Maybe users can later switch between classic-eval and Neural-Network as a UCI option.

For the stronger HCE engines I think that is important and after all I saw your classic version have an interesting playing-style. Looking a bit on christmas days vs. last Wasp with classical-eval.

Again ...
Good luck!

Best
Frank

lithander · Post by **lithander** » Thu Jan 18, 2024 3:10 pm

Frank Quisinsky wrote: ↑Wed Jan 10, 2024 3:35 pm Loud thinking, hope you don't mind ...
It would be great if your old classic-eval does not get lost.
Maybe users can later switch between classic-eval and Neural-Network as a UCI option.

For the stronger HCE engines I think that is important and after all I saw your classic version have an interesting playing-style. Looking a bit on christmas days vs. last Wasp with classical-eval.

Thanks for your interest and kind words, Frank!

The first version of Leorik with NNUE based eval would be released as version 3.0 and so there is the opportunity to further support the HCE based version of Leorik by releasing version 2.6 at a later time. Similarly I also plan to release a version 1.1 in the near future because I think Leorik 1 had some interesting properties that got lost with version 2.

So basically each major version is reasonably distinct from it's successor that it deserves to be maintained.

What I don't want to do though is to create an "Eierlegende Wollmilchsau" because keeping the code clean and readable and uncluttered is an important goal of mine. So HCE and NNUE will not coexist in the same branch of the repository. It would mean I have to run even more tests and testing is already one of the biggest bottlenecks...

mvanthoor · Post by **mvanthoor** » Thu Jan 18, 2024 5:06 pm

lithander wrote: ↑Thu Jan 18, 2024 3:10 pm so there is the opportunity to further support the HCE based version of Leorik by releasing version 2.6

Couldn't you make an interface for the evaluation and then support both with a command-line parameter?

I used traits for my communication module (which is called IComm), and it has some functions that enables the engine to both support UCI and XBoard, without the actual engine knowing about it. (Except for the fact that it has a UCI handler and an XBoard handler, obviously.) The only addition to the engine is that it switches states per XBoard protocol, but that is nothing more than setting a variable using an enum. The UCI handler doesn't take the state into account; the XBoard handler does.

I'm planning to do the same with search, implementing MCTS (monte-carlo) at some point, and evaluation (NNUE) in the future. That way you could have an XBoard engine running MCTS with NNUE, or a UCI engine running Alpha/Beta with HCE. Effectively 6 engines in one

I'm guessing that NNUE is going to get me stuck for another 3 years (just like Texel) until I unearth the meaning of one particularly arcane (mathematical) impression, and then it's going to be rather easy to implement....

lithander · Post by **lithander** » Sat Jan 27, 2024 3:20 pm

lithander wrote: ↑Mon Jan 08, 2024 5:56 pm I feel ready to transition Leorik to a NNUE based eval. The steps I plan to take here are:

1.) take an existing net and frankenstein it onto Leorik to figure out the inference part
2.) train a small net using my own data based on 100% HCE Leorik selfplay (see above)
3.) use this neural net powered version of Leorik to generate more data and transition to larger architectures

I have already started with step one and can say that NNUE is less complicated than I feared.

I have chosen Stormphrax net010 as a starting point. https://github.com/Ciekce/Stormphrax/tr ... 6/src/eval

Since my last update I pretty much did what I announced above.
The first step was to write all the code necessary to replace Leorik's HCE evaluation with a NNUE one, specifically I wanted to use Stormphrax net010.

The network architecture is pretty simple. Surprisingly simple. The information I had (like this book) described the network architecture used by Stockfish 12 and based on the evolution of network architectures that Stockfish uses it's not exactly getting any simpler:

But in reality a lot of NNUE engines use something far simpler than the simplest Stockfish architecture. Instead of using HalfKP with a staggering 81920 input features you can get away with only 768 input features. We dedicate an input to each possible piece square combination. 6 piece types * 2 colors * 64 squares == 768. The simplest viable evaluation method that every chess programmer knows, Piece-Square tables uses the same idea.
And "input feature" is also an exaggerated term. If you look at the array in practice all you see is a bunch of 1s in a sea of 0s.

So we can agree that the input part is pretty trivial. But what about the hidden layers? The 'neurons'?
In Stockfish 12 you had three hidden layers with 576 neurons total. It's important to consider that in a fully connected layer each neuron has a connection (synapse) to each of the neurons from the previous layer. So while 576 neurons doesn't sound so bad, with 3 hidden layers it's still a lot of synapses and somewhat complicated to train. And as mentioned above Stockfish never simplified but went beyond that initial design in complexity...

But other engines? They often just use only a single hidden layer that is fully connected to the 756 inputs. Even if you put 1024 neurons in that layer you end up with far less synapses than what Stockfish 12 had. So, is this simple approach actually viable? How much Elo does it add over my previously used HCE?

To find out I converted the evaluation to C# (lot's of for-loops) and was happy when the evaluation started to spit out the right numbers. But my engine's speed was abysmal. I was used to 5M nodes per second and now I was down to 50K nps. My old eval was 100x faster and no eval in the world can be so good to compensate for that (well maybe Lc0's can^^) so... what did I do wrong?

The first problem was that I had ignored the EU part of NNUE. The efficient updates of the accumulator where you acknowledge the fact that when a piece moves the neural network state doesn't have to be recomputed from scratch but can be created more cheaply from a copy of the previous state: Subtract a range of weights from the accumulator for the pieces-square feature that is no longer present and then add a range of weights to the accumulator for the new piece square feature that represents the piece being on a new square after the move.
And the other problem was of course that the C# compiler sees a for-loop and converts it into IL-code literally while the C++ compiler realizes the potential for vectorization and uses AVX2 intrinsics automatically.

I had already handwritten AVX2 based code for my hand crafted evaluation (as readers of this devlog may know) so I started to do the compilers job and tell the CPU very explicitly how I thought it should do the necessary computations.

If you read C# here's the code

Code: Select all

https://github.com/lithander/Leorik/blob/nnue/Leorik.Core/Evaluation/NeuralNetEval.cs

but for the non-programmers I can summarize it like this: You have to do a lot of multiplications and additions and modern CPUs can do one such a basic instruction on multiple numbers at the same time. They have 256bit wide registers and if the numbers you want to add and multiply can be encoded as 16 bit integers that means you can handle 16 of them at the same time. So in theory there's room for 16x speedup. And in practice I got a 13x speed-up out of it which is pretty amazing to me! (considering all the overhead and reduced clockrates I expected less)

How fast is the NNUE eval now? It linearly scales with the number of neurons in the hidden layer. Stormphrax uses (768->768)x2->1 and then it's still slower than my HCE but with just 16 neurons it's even faster.

And how strong is it? When using Stormphrax net, that was trained on billions of positions, the new Leorik was 300 Elo stronger than the HCE one!! Awesome considering how simple this evaluation approach is! Yes simple, I mean it. If you understand tapered PSQTs and use incremental updates already and know how to find the values with gradient descent you have understood all main concepts needed to completely understand what's powering a NNUE architecture as described above.

So with the inference part solved the next step was to find my own network weights (the strength of the synapses connecting the input with the hidden layer and the hidden layer with the output) instead of relying on a 3rd party network.

My HCE was completely trained from scratch using only selfplay games and so I already had a bunch of training data lying around, specifically unfiltered playouts stored in a binary format based on the marlinflow format. (i wrote about it before)
I decided to use bullet for the training. It's pretty common that you use a 3rd party GPU trainer because this is a pretty generic machine learning task and not very engine specific. Berzerk for example used it's own trainer, then the one from Koivisto and now Grapheus. And bullet is the trainer that the folks in the Engine Programming Discord use and so that's what I used too.

After solving the pipeline issues (getting my playouts in the right data format) I could train different networks for different layer sizes. The smallest one I tried had only 16 neurons and was blazingly fast to train on the 33M labeled positions I had (about 1 minute using bullet). It was also really fast in my engine, too: I got 5.7M nps speed which was better then my HCE. And I already saw +50 Elo compared to my HCE with just 16(!) neurons!

Now I expected that when I'd increase the layer size I would initially see an increase in playing strength until my data wasn't plentiful and diverse enough to saturate the rapidly increasing number of synapses in any meaningful way. And this was what I saw indeed. 64 and 128 HL were the peak at 230 Elo, than it got worse again (512 -> 180 Elo)

I know that at least +300 Elo should be possible because that's what I saw with the "borrowed" network. But I have so many knobs and levers to tweak now both for the datageneration part, the filtering and finally the tuner that there's endless potential to fiddle around searching for sweet spots.

Currently I'm in the unfortunate situation that I lost the recipe to my best dataset (27GB in size) because I tried too many things at once and documented them poorly. But generally I'm confident that I can achieve the +300 Elo with 256 hidden layers.

So Version 3.0 of Leorik will be released with a (768->256)x2->1 network. Once I have that one and it's performing solidly I'll also have a pass at the search code. Some eval-based cutoff values need to be retuned and there are some old ideas I have had that failed back then that I want to try again. Like sorting quiet moves with a bad history based on on their static eval or Qsearch results...

So Leorik 3.0 isn't coming soon but it's coming and I expect a big jump in playing strength.

Thanks a lot JacquesRW & Cieckce for helping me figure it all out!

lithander · Post by **lithander** » Sat Jan 27, 2024 3:27 pm

lithander wrote: ↑Mon Jan 08, 2024 5:56 pm I feel ready to transition Leorik to a NNUE based eval. The steps I plan to take here are:

1.) take an existing net and frankenstein it onto Leorik to figure out the inference part
2.) train a small net using my own data based on 100% HCE Leorik selfplay (see above)
3.) use this neural net powered version of Leorik to generate more data and transition to larger architectures

I have already started with step one and can say that NNUE is less complicated than I feared.

I have chosen Stormphrax net010 as a starting point. https://github.com/Ciekce/Stormphrax/tr ... 6/src/eval

Since my last update I pretty much did what I announced above.
The first step was to write all the code necessary to replace Leorik's HCE evaluation with a NNUE one, specifically I wanted to use Stormphrax net010.

The network architecture is pretty simple. Surprisingly simple. The information I had (like this book) described the network architecture used by Stockfish 12 and based on the evolution of network architectures that Stockfish uses it's not exactly getting any simpler:

But in reality a lot of NNUE engines use something far simpler than the simplest Stockfish architecture. Instead of using HalfKP with a staggering 81920 input features you can get away with only 768 input features. We dedicate an input to each possible piece square combination. 6 piece types * 2 colors * 64 squares == 768. The simplest viable evaluation method that every chess programmer knows, Piece-Square tables uses the same idea.
And "input feature" is also an exaggerated term. If you look at the array in practice all you see is a bunch of 1s in a sea of 0s.

So we can agree that the input part is pretty trivial. But what about the hidden layers? The 'neurons'?
In Stockfish 12 you had three hidden layers with 576 neurons total. It's important to consider that in a fully connected layer each neuron has a connection (synapse) to each of the neurons from the previous layer. So while 576 neurons doesn't sound so bad, with 3 hidden layers it's still a lot of synapses and somewhat complicated to train. And as mentioned above Stockfish never simplified but went beyond that initial design in complexity...

But other engines? They often just use only a single hidden layer that is fully connected to the 768 inputs. Even if you put 1024 neurons in that layer you end up with far less synapses than what Stockfish 12 had. So, is this simple approach actually viable? How much Elo does it add over my previously used HCE?

To find out I converted the evaluation to C# (lot's of for-loops) and was happy when the evaluation started to spit out the right numbers. But my engine's speed was abysmal. I was used to 5M nodes per second and now I was down to 50K nps. My old eval was 100x faster and no eval in the world can be so good to compensate for that (well maybe Lc0's can^^) so... what did I do wrong?

The first problem was that I had ignored the EU part of NNUE. The efficient updates of the accumulator where you acknowledge the fact that when a piece moves the neural network state doesn't have to be recomputed from scratch but can be created more cheaply from a copy of the previous state: Subtract a range of weights from the accumulator for the pieces-square feature that is no longer present and then add a range of weights to the accumulator for the new piece square feature that represents the piece being on a new square after the move.
And the other problem was of course that the C# compiler sees a for-loop and converts it into IL-code literally while the C++ compiler realizes the potential for vectorization and uses AVX2 intrinsics automatically.

I had already handwritten AVX2 based code for my hand crafted evaluation (as readers of this devlog may know) so I started to do the compilers job and tell the CPU very explicitly how I thought it should do the necessary computations.

If you read C# here's the code but for the non-programmers I can summarize it like this: You have to do a lot of multiplications and additions and modern CPUs can do one such a basic instruction on multiple numbers at the same time. They have 256bit wide registers and if the numbers you want to add and multiply can be encoded as 16 bit integers that means you can handle 16 of them at the same time. So in theory there's room for 16x speedup. And in practice I got a 13x speed-up out of it which is pretty amazing to me! (considering all the overhead and reduced clockrates I expected less)

How fast is the NNUE eval now? It linearly scales with the number of neurons in the hidden layer. Stormphrax uses (768->768)x2->1 and then it's still slower than my HCE but with just 16 neurons it's even faster.

And how strong is it? When using Stormphrax net, that was trained on billions of positions, the new Leorik was 300 Elo stronger than the HCE one!! Awesome considering how simple this evaluation approach is! Yes simple, I mean it. If you understand tapered PSQTs and use incremental updates already and know how to find the values with gradient descent you have understood all main concepts needed to completely understand what's powering a NNUE architecture as described above.

So with the inference part solved the next step was to find my own network weights (the strength of the synapses connecting the input with the hidden layer and the hidden layer with the output) instead of relying on a 3rd party network.

My HCE was completely trained from scratch using only selfplay games and so I already had a bunch of training data lying around, specifically unfiltered playouts stored in a binary format based on the marlinflow format. (i wrote about it before)
I decided to use bullet for the training. It's pretty common that you use a 3rd party GPU trainer because this is a pretty generic machine learning task and not very engine specific. Berzerk for example used it's own trainer, then the one from Koivisto and now Grapheus. And bullet is the trainer that the folks in the Engine Programming Discord use and so that's what I used too.

After solving the pipeline issues (getting my playouts in the right data format) I could train different networks for different layer sizes. The smallest one I tried had only 16 neurons and was blazingly fast to train on the 33M labeled positions I had (about 1 minute using bullet). It was also really fast in my engine, too: I got 5.7M nps speed which was better then my HCE. And I already saw +50 Elo compared to my HCE with just 16(!) neurons!

Now I expected that when I'd increase the layer size I would initially see an increase in playing strength until my data wasn't plentiful and diverse enough to saturate the rapidly increasing number of synapses in any meaningful way. And this was what I saw indeed. 64 and 128 HL were the peak at 230 Elo, than it got worse again (512 -> 180 Elo)

I know that at least +300 Elo should be possible because that's what I saw with the "borrowed" network. But I have so many knobs and levers to tweak now both for the datageneration part, the filtering and finally the tuner that there's endless potential to fiddle around searching for sweet spots.

Currently I'm in the unfortunate situation that I lost the recipe to my best dataset (27GB in size) because I tried too many things at once and documented them poorly. But generally I'm confident that I can achieve the +300 Elo with 256 hidden layers.

Version 3.0 of Leorik will be released with a (768->256)x2->1 network. Once I have that one and it's performing solidly I'll also have a pass at the search code. Some eval-based cutoff values need to be retuned and there are some old ideas I have had that failed back then that I want to try again. Like sorting quiet moves with a bad history based on on their static eval or Qsearch results...

So Leorik 3.0 isn't coming soon but it's coming and I expect a big jump in playing strength.

Thanks a lot JacquesRW & Cieckce for helping me figure it all out!

Frank Quisinsky · Post by **Frank Quisinsky** » Sun Jan 28, 2024 8:06 pm

Hi Thomas,

again, a lot of interesting reading.
I like such messages so much ...

But in the end ...
I am very sure that Neural-Network is a bit of an illusion if we write or wish us x Elo stronger. Often i read here and there, with Neural-Network Stockfish is 300-400 Elo stronger, also from Dragon developers I read Neural-Network gives more as 300 Elo and from so many others. Sorry, but all this is not right.

For blitz rules neural-network often gives such results, but if you switch to rules with longer time controls the advantage is really much smaller. And most or I think all what is available is blitz or maybe only a bit better.

You can see this in my 66+6 (250 minutes games) tournament.
On start Stockfish 2000731 (the last version without neural network before the first version with neural-network was available).
The difference to the current Dragon 3.3 NN (Komodo) is not more than 100 Elo.
After about 230 games SF has lost only 8 games.

In my 40 in 20 tournament (90 minutes games) SF 16 is 5 Elo stronger than Dragon 3.3 NN.
Maybe the current Stockfisch is max. 20 Elo stronger as SF 16, so the difference after about 4 years of Stockfish development with longer time controls is about

~ 120 Elo with 2.5 hours per game rule. In 3-4 months with more games and a SF 17 test (if available) with the same group of engines it should be more clear.

And since we know this, it is quite clear that the difference for 500-minutes-games or 1000-minutes-games is again much smaller. Possible it goes to 0-50 Elo advantage I think and the draw-quote go higher and higher.

Neural-Network is good, important and nice to have. But it is not more as a performance boost if we are looking a bit deeper.

Often I am looking in complicated positions with Wasp 6.50 NN over night. Wasp 6.50 NN is a program with is better and better with more time. Easy to see ... have a look in blitz results and in results with longer time-controls. Often I am thinking the same for all of the other strong tactical engines. After a long-time analyzes I often thinking that the final results is much better as the result from Stockfish.

Example:
If the different from Wasp 6.50 NN in blitz 210 Elo to RubiChess 20240124 NN, not a wonder that the different with 66+6 is around 100-110 Elo to RubiChess. What I like to say is ... that have not only to do with Neural-Network. But its very difficult to see that the neural-network advantage melts away in really big steps. And with such strong engines Stockfish is I think it's very easy to see because the engine was incredible strong without neural-network.

Neural-Network a playing-strength illusion?
I think so!

Best
Frank

JacquesRW · Post by **JacquesRW** » Sun Jan 28, 2024 9:48 pm

Frank Quisinsky wrote: ↑Sun Jan 28, 2024 8:06 pm I am very sure that Neural-Network is a bit of an illusion if we write or wish us x Elo stronger. Often i read here and there, with Neural-Network Stockfish is 300-400 Elo stronger, also from Dragon developers I read Neural-Network gives more as 300 Elo and from so many others. Sorry, but all this is not right.

Anyone can pick test conditions as you have to be as drawish as possible and make the elo difference approach 0. You are making a measurement of the drawishness of the game, not the strength of Neural Networks.

Frank Quisinsky · Post by **Frank Quisinsky** » Sun Jan 28, 2024 10:58 pm

Hi Jamie,

yes, its true I think to more as 50-70% but that explain not different other things. A very complicated topic, should be not discuss in the Leorik thread but after I read that nice message by Thomas that was the first I am thinking on it.

At the moment I make here some experiments with machines over network (machines I can use from a school-room). So I let play Stockfish 200731 vs Stockfish 200731 with different time controls 4+2 vs. 66+6. The same with Stockfish 240121 NN dev. Not sure, maybe the idea is wrong to get more clarity.

Best
Frank

PS: Better is not to start such experiments. Because never I will get a result for ... resolve doubts.

lithander · Post by **lithander** » Sun Jan 28, 2024 11:20 pm

When I say +300 Elo I mean to summarize results like this:

Code: Select all

Score of Leorik-NNUE vs Leorik-2.5.6: 2155 - 124 - 646  [0.847] 2925
...      Leorik-NNUE playing White: 1136 - 38 - 289  [0.875] 1463
...      Leorik-NNUE playing Black: 1019 - 86 - 357  [0.819] 1462
...      White vs Black: 1222 - 1057 - 646  [0.528] 2925
Elo difference: 297.5 +/- 13.2, LOS: 100.0 %, DrawRatio: 22.1 %

Here you can see that the NNUE version is destroying Leorik-2.5.6 at very fast time controls. (5s + 0.1s increment)
I measure the quality of a network candidate by how much it destroys my HCE (among other's). And as you can see the Draw ratio is only 22% so there's no doubt that the NNUE is much stronger than Leorik's HCE.

Only when two very strong engines meet then draws on balanced position become a problem but I think that's a property of chess: If every side plays near perfect moves there won't be a winner.

To combat that it has become common to start with positions from unbalanced openings (like UHO) and to make it fair again you can play each opening twice so each player get's to play black and white. Then you can watch these super human chess entities convert tiny advantages into astonishing wins.

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik