SPCC: New Super 3 Tournament started

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

Posts: 1819
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: SPCC: New Super 3 Tournament started

Post by AndrewGrant »

Late to the party, but going from float32 -> int16, or to int8, is not the most meaningful and profound thing. Of course it is good. But it is not without downside. If ssse3, avx2, ..., were to magically not exist for integers, NNUE would still be good.

Using int16 in the FT can be considered a pretty easy doubling of speed. But using int16 for the first layer would not be, and might even be near equal to the speed of floats (!). This is because you do int16 multiplies, which become int32s, and then there is overhead associated with summing up in the int32 instead of the int16. Where as floats have FMA, that gets that all done quite nicely, even if only operating on half as many inputs at once.

Using int8/int16 for the later layers of the nets has practically no performance benefit. I know this, because I have measured this in Torch for example, going between using floats for l2/l3, vs using the quantized int8 weights. Basically 0.00 elo, but you still opt for the int8 weights, to avoid the oddities associated with the order of operations for floats, and how that can change your bench when going between ssse3, avx, and avx512.

The benefit of int8 in L1 is also somewhat overshadowed, due to how engines tend to implement sparse-affine, by grouping together 4 inputs at a time ( Anything else requires extra overheads, which may or may not be worth it ). This leads to still computing dead neurons. Floats get around that because they happen to already be 32 bits, you might say.

Another point of note, is that the execution speed of the NNUE might not even be THAT important... at least when you are running with multiple threads. Shared evals via the TT prevent engines from needing to run the eval again, and prevents good engines from having to update the accumulators.

A final, super small caveat, is machines with AVX support, but not AVX2 support. Those machines give you access to 256-bit ops with floats, but only 128-bit ops with integers. Depending on the architecture ( true size of execution ports, number of execution ports for various ops, latency associated with individual instructions ), it is potentially quite easy to make the case that int16/int8 are no better than floats.

But for anyone serious... you want to get int16 in the FT ASAP, and work your way towards int16 and then int8 in L1. L2 and L3 do not matter at all. If you plan to hyper-optimize the AVX instruction set, then you'll need to be in the integer space, as there is more meat on the bone there.
Friendly reminder that stealing is a crime, is wrong, and makes you a thief.
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )