Devlog of Leorik

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
lithander
Posts: 903
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Devlog of Leorik

Post by lithander »

op12no2 wrote: Mon Mar 31, 2025 9:13 am Nice. Have you tried SqrRelu? I find it better than Screlu.
The S in SCRelu is "squared" so you do you mean squared but not clipped?
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
op12no2
Posts: 539
Joined: Tue Feb 04, 2014 12:25 pm
Location: Gower, Wales
Full name: Colin Jenkins

Re: Devlog of Leorik

Post by op12no2 »

Yeah, sorry I was using the bullet naming convention; squared but not clipped.

https://github.com/op12no2/lozza/blob/c ... a.js#L1994
User avatar
lithander
Posts: 903
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Devlog of Leorik

Post by lithander »

When I tried Squared-Clipped-Relu vs Clipped-Relu I couldn't get it to gain at first because I didn't know how to implement it without losing too much speed. Only when I figured out how I could continue to use _mm256_madd_epi16 instead of widening to int too early the SCRelu managed to beat CRelu in practice.

So I'm using quantization which means instead of using floats (32bit) values are represented with integer values. Everything is multiplied with a quantization factor (e.g. 255) and then rounded to nearest int. So if you have float values in the range [0..1] and chose a quantization factor 255 now all these values neatly fit into 8bits. (with a loss of precision ofc)

When doing NNUE inference normally you'd compute the activation function and then multiply the result with weight. For SCRelu you do this:

Code: Select all

f(x) = clamp(x, 0, 1)^2 * weight

With quantization however it's more efficient to do it like this:

Code: Select all

a = clamp(x, 0, 1)
f(x) = (a * weight) * a

And this is only because the clamped a is known to be in the range of [0..255] and if you quantize the weights to be in the range of [-127..127] then voila the result doesn't overflow a short and you can use _mm256_madd_epi16 for twice the throughput than what you'd get if you'd really square the clipped activation. :shock: :idea:

...in other words. This is complicated stuff not because the math is hard but you also need to run these billions of multiply & add operations as fast as possible or the small precision gain from a slightly superior activation function isn't worth the speedloss.

So, in my case I think SqrRelu won't help me improve because I can't see a way to implement it as fast as quantized SCRelu currently is implemented. I need the clipping & and quantization to squeeze everything into shorts!
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess