Basic NNUE questions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

amanjpro
Posts: 883
Joined: Sat Mar 13, 2021 1:47 am
Full name: Amanj Sherwany

Basic NNUE questions

Post by amanjpro »

I am a noob when it comes to NN and NNUE, I have found a few articles on how to implement a simple NN from scratch. One implements a trainer and a predictor in GoLang, with little to no library use. Which on paper sounds like a perfect way to start for me. However I have a few questions:

- Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?
- I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
- GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?


Last, Since I am a beginner, I don't have a test dataset to make sure that what I do is sensible. While I want to use my own data when I am done to train the network, but it will be nice to start with some data just to make it easier to jumpstart.

Thanks for your answers :)
niel5946
Posts: 174
Joined: Thu Nov 26, 2020 10:06 am
Full name: Niels Abildskov

Re: Basic NNUE questions

Post by niel5946 »

I am by no means an NNUE (or NN) expert, but from what I have read, I think I can answer your questions. If I am wrong, please correct me.
amanjpro wrote: Wed Aug 04, 2021 3:43 pm - Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?
NNUE isn't any different from other kinds of neural networks. It is just a fully connected multi-layer perceptron with ReLU activations. So, yes you can train it as any normal NN. The only difference between NNUE and regular NN's is its input configuration and updates.
amanjpro wrote: Wed Aug 04, 2021 3:43 pm - I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
NNUE is structured such that it has an input layer with ~40k neurons and way smaller hidden layers (some are 256x32x32 i think). This can store a lot of information, but would be extremely slow to compute entirely each time a move was made/unmade. Therefore, what they do, is that when a piece moves, the neurons of the first hidden layer is updated.
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.
amanjpro wrote: Wed Aug 04, 2021 3:43 pm - GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?
I don't know how fast GoLang is, but considering the incredible strength increase many engines have had from implementing NNUE, I would at least give it a try. Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo (which is miniscule compared to the strength increase from NNUE).
Author of Loki, a C++ work in progress.
Code | Releases | Progress Log |
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Basic NNUE questions

Post by Sopel »

amanjpro wrote: Wed Aug 04, 2021 3:43 pm - Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?
Yes, as niel said. The only thing that's not common in machine learning is the use ClippedReLU instead of ReLU. ClippedReLU also clips values above 1. This is not only the requirement for aggressive quantization but also yields better results in my experiments.
amanjpro wrote: Wed Aug 04, 2021 3:43 pm - I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
The part that updates the first set of neurons incrementally on each move is not that major. It's irrelevant for the trainer and an optional optimization for the player. The biggest factor that makes NNUE efficient is that the one-hot-encoded input is sparse, generally the density is on the order 0.1%. This allows specialized vector times matrix that utilizes the sparse input. However it's not a usual case and typical GPU accelerated training frameworks don't do this very well, albeit still better than a dense multiplication. If you want to take it seriously you will need to write some specialized code for this part to achieve good training performance. Doing it in the player is trivial, not much different from typical vector x matrix.
amanjpro wrote: Wed Aug 04, 2021 3:43 pm - GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?
If you can run C code from Golang then you'll be fine. Also you might be ok if your compiler performs autovectorization. If there will be no vectorization at all the results might be underwhelming.

amanjpro wrote: Wed Aug 04, 2021 3:43 pm Last, Since I am a beginner, I don't have a test dataset to make sure that what I do is sensible. While I want to use my own data when I am done to train the network, but it will be nice to start with some data just to make it easier to jumpstart.
There's plenty of publicly available data, for example here https://drive.google.com/drive/folders/ ... QIrgKJsFpl. Usually we store the data in .binpack format because it achieves ~2-3 bytes per position, but there formats that are easier to parse - .bin (40 bytes per entry, fixed), .plain (~100 bytes per entry). The formats can be converted between by https://github.com/official-stockfish/S ... tree/tools, https://github.com/official-stockfish/S ... convert.md.

I also recommend checking out https://github.com/glinscott/nnue-pytor ... cs/nnue.md, which should answer most of the questions you can have about NNUE.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Basic NNUE questions

Post by jdart »

Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo
AVX2 in my experience is about 10x faster than non-SIMD code for NNUE eval. So it makes a substantial difference.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Basic NNUE questions

Post by Joost Buijs »

jdart wrote: Fri Aug 06, 2021 5:19 am
Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo
AVX2 in my experience is about 10x faster than non-SIMD code for NNUE eval. So it makes a substantial difference.
I've never seen differences that large. CLang is very good at vectorizing most of the things you need for NNUE, of course it uses SIMD under the hood but it's not really necessary to write SIMD code yourself. MSVC is a lot worse in this respect, almost a factor 2, but I've never seen a 10 times difference with optimized handwritten SIMD code.

AVX-512 helps a bit, in practice I gain 20 to 25% but not the 100% all these articles on the internet want you to believe.

Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.

Int16 is the way to go, and maybe in the future BFloat16.
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Basic NNUE questions

Post by Sopel »

Joost Buijs wrote: Fri Aug 06, 2021 11:01 pm
Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.

Int16 is the way to go, and maybe in the future BFloat16.
That's interesting, because

1. the overall absolute error we're seeing from int8 quantization in Stockfish is within around 10 cp for reasonable eval magnitudes, and much lower for evals within a pawn
2. int16 quantization requires 2x the loads, 2x memory footprint, and has 2x smaller multiplication density (well, without VNNI it's closer to 1.2x, not counting loads. int8 is 4x maddubs + 2x add + 2x madd + 2x add, int16 would be 8x madd + 4x add, for the same computation, I think)
3. horizontal adds take a negligible amount of time as it's 3 hadds per 4 columns, and there are ways to implement it without needing them at all
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Basic NNUE questions

Post by Sopel »

ad 1. We've actually tested a float implementation against the quantized one at fixed nodes and they were about equal with very good confidence.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
amanjpro
Posts: 883
Joined: Sat Mar 13, 2021 1:47 am
Full name: Amanj Sherwany

Re: Basic NNUE questions

Post by amanjpro »

niel5946 wrote: Thu Aug 05, 2021 2:29 pm NNUE is structured such that it has an input layer with ~40k neurons and way smaller hidden layers (some are 256x32x32 i think). This can store a lot of information, but would be extremely slow to compute entirely each time a move was made/unmade. Therefore, what they do, is that when a piece moves, the neurons of the first hidden layer is updated.
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.
Oh! not sure if I understand everything written here, but I believe as soon as I have a better idea of how NN works, I should come back to this ;)
niel5946 wrote: Thu Aug 05, 2021 2:29 pm I don't know how fast GoLang is, but considering the incredible strength increase many engines have had from implementing NNUE, I would at least give it a try. Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo (which is miniscule compared to the strength increase from NNUE).
I don't mind paying some price, but I don't want to spend a lot of time, to implement something that is very unoptimized... this is reassuring :)
Sopel wrote: Thu Aug 05, 2021 9:56 pm If you can run C code from Golang then you'll be fine. Also you might be ok if your compiler performs autovectorization. If there will be no vectorization at all the results might be underwhelming.
This is actually a no go, I believe even a naiive Go-lang version will outperform that, as the overhead of calling C-functions is wayyyyy too high

Thanks a lot for your answers, I'm still learning, so it will be a while before I have something functional, but you have helped me greatly! will start the hard work ;)
ZirconiumX
Posts: 1334
Joined: Sun Jul 17, 2011 11:14 am

Re: Basic NNUE questions

Post by ZirconiumX »

Go actually has support for its own strange flavour of assembly, and it uses this for the complicated machine-specific things; so I would suggest you write the performance-intensive section in raw assembly and then have Go call into it. Unlike cgo, assembly calls from Go are fairly cheap if you annotate them as the docs ask.
Some believe in the almighty dollar.

I believe in the almighty printf statement.
amanjpro
Posts: 883
Joined: Sat Mar 13, 2021 1:47 am
Full name: Amanj Sherwany

Re: Basic NNUE questions

Post by amanjpro »

ZirconiumX wrote: Sat Aug 07, 2021 2:22 am Go actually has support for its own strange flavour of assembly, and it uses this for the complicated machine-specific things; so I would suggest you write the performance-intensive section in raw assembly and then have Go call into it. Unlike cgo, assembly calls from Go are fairly cheap if you annotate them as the docs ask.
Yeah, I believe gonum is doing that... I'll go for none optimized version first, as I don't want to lose the support of other platforms... Then I'll probably start working on that "assembly" thing, if I'm courageous enough