Devlog of Leorik

lithander · Post by **lithander** » Mon Mar 31, 2025 11:17 am

op12no2 wrote: ↑Mon Mar 31, 2025 9:13 am Nice. Have you tried SqrRelu? I find it better than Screlu.

The S in SCRelu is "squared" so you do you mean squared but not clipped?

op12no2 · Post by **op12no2** » Tue Apr 01, 2025 12:35 pm

Yeah, sorry I was using the bullet naming convention; squared but not clipped.

https://github.com/op12no2/lozza/blob/c ... a.js#L1994

lithander · Post by **lithander** » Tue Apr 01, 2025 2:38 pm

When I tried Squared-Clipped-Relu vs Clipped-Relu I couldn't get it to gain at first because I didn't know how to implement it without losing too much speed. Only when I figured out how I could continue to use _mm256_madd_epi16 instead of widening to int too early the SCRelu managed to beat CRelu in practice.

So I'm using quantization which means instead of using floats (32bit) values are represented with integer values. Everything is multiplied with a quantization factor (e.g. 255) and then rounded to nearest int. So if you have float values in the range [0..1] and chose a quantization factor 255 now all these values neatly fit into 8bits. (with a loss of precision ofc)

When doing NNUE inference normally you'd compute the activation function and then multiply the result with weight. For SCRelu you do this:

Code: Select all

f(x) = clamp(x, 0, 1)^2 * weight

With quantization however it's more efficient to do it like this:

Code: Select all

a = clamp(x, 0, 1)
f(x) = (a * weight) * a

And this is only because the clamped a is known to be in the range of [0..255] and if you quantize the weights to be in the range of [-127..127] then voila the result doesn't overflow a short and you can use _mm256_madd_epi16 for twice the throughput than what you'd get if you'd really square the clipped activation.

...in other words. This is complicated stuff not because the math is hard but you also need to run these billions of multiply & add operations as fast as possible or the small precision gain from a slightly superior activation function isn't worth the speedloss.

So, in my case I think SqrRelu won't help me improve because I can't see a way to implement it as fast as quantized SCRelu currently is implemented. I need the clipping & and quantization to squeeze everything into shorts!

op12no2 · Post by **op12no2** » Fri Apr 04, 2025 8:06 am

I only have a single perspective layer and use an int32 for acculumating the eval itself and Javascript has no native SIMD facilities so my life is a lot easier in this respect

I think Stormphrax has SqrRelu in a recent net - presumably on the final layer based on what you have said (?).

lithander · Post by **lithander** » Sun Apr 06, 2025 7:42 pm

I just uploaded version 3.1 binaries to github: https://github.com/lithander/Leorik/releases/tag/3.1

Version 3.1 improves Leorik's NNUE evaluation by using a larger network (640 HL neurons) and adopting SCReLU activation. The network was trained from scratch over 19 generations. The network 640HL-S-5288M-Tmix-Q5-v19 that releases with version 3.1, was trained on 5.2B positions from generation 13 to 19. Search improvements include the addition of Correction History, increased reduction of late quiet moves, a dynamic threshold for identifying such moves, and the introduction of RFP with dynamic margins derived from NMP statistics.

I haven't run matches against 3.0 or other engines yet so I'm pretty curious how big the strength increase is. Will post later about that.

lithander · Post by **lithander** » Sun Apr 06, 2025 10:31 pm

Code: Select all

/cutechess-cli.exe -engine conf="Leorik-3.1" -engine conf="Leorik-3.0.1" 
 -each tc=5+0.1 -openings file="UHO_2024_8mvs_big_+095_+114.pgn" 
 -concurrency 11

Score of Leorik-3.1 vs Leorik-3.0.1: 2874 - 483 - 1643  [0.739] 5000
...      Leorik-3.1 playing White: 1988 - 50 - 462  [0.888] 2500
...      Leorik-3.1 playing Black: 886 - 433 - 1181  [0.591] 2500
...      White vs Black: 2421 - 936 - 1643  [0.648] 5000
Elo difference: 180.9 +/- 8.3, LOS: 100.0 %, DrawRatio: 32.9 %

+180 Elo selfplay, fast tc, unbalanced openings

lithander · Post by **lithander** » Sat Apr 12, 2025 11:22 pm

The preliminary results of testing of Leorik 3.1 for the CEGT 40/4 list confirm my previous estimate of +180 Elo, now against a wider range of opponents!

Meanwhile I have updated the Leorik binaries released on github to version 3.1.2 of similar strength but where the internal eval is scaled by a factor before getting printed in the uci output. Outputting the raw eval values makes the engine look very dramatic as scores are about 3x bigger than what you'd expect. That's because training a net is all about minimizing the prediction error and not at all aware of what a centipawn is.

Only after release of 3.1 I learned that most engines scale the internal eval before printing and also how to derive a good scaling factor, namely with the tool: https://github.com/official-stockfish/WDL_model

You can give it a bunch of pgn files containing games that your engine has played and calculate the stats for your engine like this:

Code: Select all

./scoreWDLstat.exe --matchEngine "Leorik-3.1"

Then run

Code: Select all

python .\scoreWDL.py --NormalizeToPawnValue 100

100 is only correct, if you had no scaling before! (otherwise set according to your scaling factor)

...and within seconds you're presented with a scientific looking visualization

...and in the log you'll find a line like const int NormalizeToPawnValue = 306; and while I think it's a bit confusingly named it's the score at which the propability of a win/loss is 50% which is (according to stockfish) when you should evaluate the position as +/- 100 cp!

lithander · Post by **lithander** » Sat Apr 19, 2025 3:31 pm

I have trained a number of networks all compatible with Leorik 3.1 binaries. You can download them from github here

Just make sure to place the nnue file you want besides the Leorik binary. Start the binary and confirm from the console output that the nnue file you wanted was loaded correctly!

I have included a graph showing how more neurons improves the accuracy of the networks (y-axis, less loss is better) but also requires a longer training duration (x-axis, more epochs takes longer) and finally I have annotated the graphs with the relative Elo gain of doubling the HL size.

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

New Release of version 3.1

New Release of version 3.1

Re: Devlog of Leorik

Re: Devlog of Leorik