Basic NNUE questions

amanjpro · Post by **amanjpro** » Wed Aug 04, 2021 3:43 pm

I am a noob when it comes to NN and NNUE, I have found a few articles on how to implement a simple NN from scratch. One implements a trainer and a predictor in GoLang, with little to no library use. Which on paper sounds like a perfect way to start for me. However I have a few questions:

- Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?
- I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
- GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?

Last, Since I am a beginner, I don't have a test dataset to make sure that what I do is sensible. While I want to use my own data when I am done to train the network, but it will be nice to start with some data just to make it easier to jumpstart.

Thanks for your answers

niel5946 · Post by **niel5946** » Thu Aug 05, 2021 2:29 pm

I am by no means an NNUE (or NN) expert, but from what I have read, I think I can answer your questions. If I am wrong, please correct me.

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?

NNUE isn't any different from other kinds of neural networks. It is just a fully connected multi-layer perceptron with ReLU activations. So, yes you can train it as any normal NN. The only difference between NNUE and regular NN's is its input configuration and updates.

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?

NNUE is structured such that it has an input layer with ~40k neurons and way smaller hidden layers (some are 256x32x32 i think). This can store a lot of information, but would be extremely slow to compute entirely each time a move was made/unmade. Therefore, what they do, is that when a piece moves, the neurons of the first hidden layer is updated.
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?

I don't know how fast GoLang is, but considering the incredible strength increase many engines have had from implementing NNUE, I would at least give it a try. Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo (which is miniscule compared to the strength increase from NNUE).

Sopel · Post by **Sopel** » Thu Aug 05, 2021 9:56 pm

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?

Yes, as niel said. The only thing that's not common in machine learning is the use ClippedReLU instead of ReLU. ClippedReLU also clips values above 1. This is not only the requirement for aggressive quantization but also yields better results in my experiments.

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?

The part that updates the first set of neurons incrementally on each move is not that major. It's irrelevant for the trainer and an optional optimization for the player. The biggest factor that makes NNUE efficient is that the one-hot-encoded input is sparse, generally the density is on the order 0.1%. This allows specialized vector times matrix that utilizes the sparse input. However it's not a usual case and typical GPU accelerated training frameworks don't do this very well, albeit still better than a dense multiplication. If you want to take it seriously you will need to write some specialized code for this part to achieve good training performance. Doing it in the player is trivial, not much different from typical vector x matrix.

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?

If you can run C code from Golang then you'll be fine. Also you might be ok if your compiler performs autovectorization. If there will be no vectorization at all the results might be underwhelming.

amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm Last, Since I am a beginner, I don't have a test dataset to make sure that what I do is sensible. While I want to use my own data when I am done to train the network, but it will be nice to start with some data just to make it easier to jumpstart.

There's plenty of publicly available data, for example here https://drive.google.com/drive/folders/ ... QIrgKJsFpl. Usually we store the data in .binpack format because it achieves ~2-3 bytes per position, but there formats that are easier to parse - .bin (40 bytes per entry, fixed), .plain (~100 bytes per entry). The formats can be converted between by https://github.com/official-stockfish/S ... tree/tools, https://github.com/official-stockfish/S ... convert.md.

I also recommend checking out https://github.com/glinscott/nnue-pytor ... cs/nnue.md, which should answer most of the questions you can have about NNUE.

jdart · Post by **jdart** » Fri Aug 06, 2021 5:19 am

Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo

AVX2 in my experience is about 10x faster than non-SIMD code for NNUE eval. So it makes a substantial difference.

Joost Buijs · Post by **Joost Buijs** » Fri Aug 06, 2021 11:01 pm

jdart wrote: ↑Fri Aug 06, 2021 5:19 am
Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo
AVX2 in my experience is about 10x faster than non-SIMD code for NNUE eval. So it makes a substantial difference.

I've never seen differences that large. CLang is very good at vectorizing most of the things you need for NNUE, of course it uses SIMD under the hood but it's not really necessary to write SIMD code yourself. MSVC is a lot worse in this respect, almost a factor 2, but I've never seen a 10 times difference with optimized handwritten SIMD code.

AVX-512 helps a bit, in practice I gain 20 to 25% but not the 100% all these articles on the internet want you to believe.

Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.

Int16 is the way to go, and maybe in the future BFloat16.

Sopel · Post by **Sopel** » Sat Aug 07, 2021 12:09 am

Joost Buijs wrote: ↑Fri Aug 06, 2021 11:01 pm
Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.

Int16 is the way to go, and maybe in the future BFloat16.

That's interesting, because

1. the overall absolute error we're seeing from int8 quantization in Stockfish is within around 10 cp for reasonable eval magnitudes, and much lower for evals within a pawn
2. int16 quantization requires 2x the loads, 2x memory footprint, and has 2x smaller multiplication density (well, without VNNI it's closer to 1.2x, not counting loads. int8 is 4x maddubs + 2x add + 2x madd + 2x add, int16 would be 8x madd + 4x add, for the same computation, I think)
3. horizontal adds take a negligible amount of time as it's 3 hadds per 4 columns, and there are ways to implement it without needing them at all

Sopel · Post by **Sopel** » Sat Aug 07, 2021 1:11 am

ad 1. We've actually tested a float implementation against the quantized one at fixed nodes and they were about equal with very good confidence.

amanjpro · Post by **amanjpro** » Sat Aug 07, 2021 2:04 am

niel5946 wrote: ↑Thu Aug 05, 2021 2:29 pm NNUE is structured such that it has an input layer with ~40k neurons and way smaller hidden layers (some are 256x32x32 i think). This can store a lot of information, but would be extremely slow to compute entirely each time a move was made/unmade. Therefore, what they do, is that when a piece moves, the neurons of the first hidden layer is updated.
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.

Oh! not sure if I understand everything written here, but I believe as soon as I have a better idea of how NN works, I should come back to this

niel5946 wrote: ↑Thu Aug 05, 2021 2:29 pm I don't know how fast GoLang is, but considering the incredible strength increase many engines have had from implementing NNUE, I would at least give it a try. Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo (which is miniscule compared to the strength increase from NNUE).

I don't mind paying some price, but I don't want to spend a lot of time, to implement something that is very unoptimized... this is reassuring

Sopel wrote: ↑Thu Aug 05, 2021 9:56 pm If you can run C code from Golang then you'll be fine. Also you might be ok if your compiler performs autovectorization. If there will be no vectorization at all the results might be underwhelming.

This is actually a no go, I believe even a naiive Go-lang version will outperform that, as the overhead of calling C-functions is wayyyyy too high

Thanks a lot for your answers, I'm still learning, so it will be a while before I have something functional, but you have helped me greatly! will start the hard work

ZirconiumX · Post by **ZirconiumX** » Sat Aug 07, 2021 2:22 am

Go actually has support for its own strange flavour of assembly, and it uses this for the complicated machine-specific things; so I would suggest you write the performance-intensive section in raw assembly and then have Go call into it. Unlike cgo, assembly calls from Go are fairly cheap if you annotate them as the docs ask.

amanjpro · Post by **amanjpro** » Sat Aug 07, 2021 2:29 am

ZirconiumX wrote: ↑Sat Aug 07, 2021 2:22 am Go actually has support for its own strange flavour of assembly, and it uses this for the complicated machine-specific things; so I would suggest you write the performance-intensive section in raw assembly and then have Go call into it. Unlike cgo, assembly calls from Go are fairly cheap if you annotate them as the docs ask.

Yeah, I believe gonum is doing that... I'll go for none optimized version first, as I don't want to lose the support of other platforms... Then I'll probably start working on that "assembly" thing, if I'm courageous enough

Basic NNUE questions

Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions

Re: Basic NNUE questions