Joost Buijs wrote: ↑Fri Aug 06, 2021 11:01 pm
Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.
Int16 is the way to go, and maybe in the future BFloat16.
That's interesting, because
1. the overall absolute error we're seeing from int8 quantization in Stockfish is within around 10 cp for reasonable eval magnitudes, and much lower for evals within a pawn
2. int16 quantization requires 2x the loads, 2x memory footprint, and has 2x smaller multiplication density (well, without VNNI it's closer to 1.2x, not counting loads. int8 is 4x maddubs + 2x add + 2x madd + 2x add, int16 would be 8x madd + 4x add, for the same computation, I think)
3. horizontal adds take a negligible amount of time as it's 3 hadds per 4 columns, and there are ways to implement it without needing them at all
To be honest, I don't understand it either. I've done many experiments with 8 and 16 bit quantization and the speed difference between the two is negligible.
It could be that I've made a mistake somewhere in my code that makes it slower than it could be, I'm still examining this. Unfortunately the very handy Intel Intrinsics Guide was broken for many weeks which made it difficult, but yesterday it came back online.
I have a question about NNUE and multithreading. How is this handled? I would think that if special care wasn't taken to avoid multiple threads forward-propagating at the same time, the current layer would be overwritten before getting to the next one.
Does each position object have it's own instance of the net or is there something else?
niel5946 wrote: ↑Fri Aug 13, 2021 6:18 pm
Does each position object have it's own instance of the net or is there something else?
The net isn't updated at runtime, it is read-only, and it doesn't matter if multiple threads are reading it.
Stockfish maintains per-thread a stack of "accumulator" objects These are the output of the HalfKp feature. There is also in the stack information on the "dirty" state of each position (which parts were changed by the last move). It does this to support incremental update. No other part of the network code needs to be per-thread.
jdart wrote: ↑Fri Aug 13, 2021 10:34 pm
The net isn't updated at runtime, it is read-only, and it doesn't matter if multiple threads are reading it.
Stockfish maintains per-thread a stack of "accumulator" objects These are the output of the HalfKp feature. There is also in the stack information on the "dirty" state of each position (which parts were changed by the last move). It does this to support incremental update. No other part of the network code needs to be per-thread.
Ahh alright, I understand now. I thought that the network object held the neurons's activations as well as the weights and biases, which in turn also made me confused as to why I couldn't find them
Thank you for the help