I am a noob when it comes to NN and NNUE, I have found a few articles on how to implement a simple NN from scratch. One implements a trainer and a predictor in GoLang, with little to no library use. Which on paper sounds like a perfect way to start for me. However I have a few questions:
- Can I apply my learning from this article (basically, change the model layout to suit chess, and maybe a way to represent positions) to building an NNUE? Or NNUE is an entirely different specie?
- I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
- GoLang doesn't have an easy way to use AVX (or SMID) instructions, as your code compiles to an architecture independent instruction set. Does that mean I should not even bother with NNUE, because it is going to be too slow to be useful?
Last, Since I am a beginner, I don't have a test dataset to make sure that what I do is sensible. While I want to use my own data when I am done to train the network, but it will be nice to start with some data just to make it easier to jumpstart.
Thanks for your answers
Basic NNUE questions
Moderators: hgm, Rebel, chrisw
-
- Posts: 883
- Joined: Sat Mar 13, 2021 1:47 am
- Full name: Amanj Sherwany
-
- Posts: 174
- Joined: Thu Nov 26, 2020 10:06 am
- Full name: Niels Abildskov
Re: Basic NNUE questions
I am by no means an NNUE (or NN) expert, but from what I have read, I think I can answer your questions. If I am wrong, please correct me.
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.
NNUE isn't any different from other kinds of neural networks. It is just a fully connected multi-layer perceptron with ReLU activations. So, yes you can train it as any normal NN. The only difference between NNUE and regular NN's is its input configuration and updates.
NNUE is structured such that it has an input layer with ~40k neurons and way smaller hidden layers (some are 256x32x32 i think). This can store a lot of information, but would be extremely slow to compute entirely each time a move was made/unmade. Therefore, what they do, is that when a piece moves, the neurons of the first hidden layer is updated.amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.
I don't know how fast GoLang is, but considering the incredible strength increase many engines have had from implementing NNUE, I would at least give it a try. Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo (which is miniscule compared to the strength increase from NNUE).
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Basic NNUE questions
Yes, as niel said. The only thing that's not common in machine learning is the use ClippedReLU instead of ReLU. ClippedReLU also clips values above 1. This is not only the requirement for aggressive quantization but also yields better results in my experiments.
The part that updates the first set of neurons incrementally on each move is not that major. It's irrelevant for the trainer and an optional optimization for the player. The biggest factor that makes NNUE efficient is that the one-hot-encoded input is sparse, generally the density is on the order 0.1%. This allows specialized vector times matrix that utilizes the sparse input. However it's not a usual case and typical GPU accelerated training frameworks don't do this very well, albeit still better than a dense multiplication. If you want to take it seriously you will need to write some specialized code for this part to achieve good training performance. Doing it in the player is trivial, not much different from typical vector x matrix.amanjpro wrote: ↑Wed Aug 04, 2021 3:43 pm - I learnt that what makes NNUE so good, is how efficient it is to "evaluate a position" after make/unmake moves. However, looking at Koivisto and Ethereal, I didn't see any special handling for evaluation in make/unmake. Is it the small number of layers of the network that helps the eval to be efficiently updated, or there is something else?
If you can run C code from Golang then you'll be fine. Also you might be ok if your compiler performs autovectorization. If there will be no vectorization at all the results might be underwhelming.
There's plenty of publicly available data, for example here https://drive.google.com/drive/folders/ ... QIrgKJsFpl. Usually we store the data in .binpack format because it achieves ~2-3 bytes per position, but there formats that are easier to parse - .bin (40 bytes per entry, fixed), .plain (~100 bytes per entry). The formats can be converted between by https://github.com/official-stockfish/S ... tree/tools, https://github.com/official-stockfish/S ... convert.md.
I also recommend checking out https://github.com/glinscott/nnue-pytor ... cs/nnue.md, which should answer most of the questions you can have about NNUE.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: Basic NNUE questions
AVX2 in my experience is about 10x faster than non-SIMD code for NNUE eval. So it makes a substantial difference.Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: Basic NNUE questions
I've never seen differences that large. CLang is very good at vectorizing most of the things you need for NNUE, of course it uses SIMD under the hood but it's not really necessary to write SIMD code yourself. MSVC is a lot worse in this respect, almost a factor 2, but I've never seen a 10 times difference with optimized handwritten SIMD code.jdart wrote: ↑Fri Aug 06, 2021 5:19 amAVX2 in my experience is about 10x faster than non-SIMD code for NNUE eval. So it makes a substantial difference.Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo
AVX-512 helps a bit, in practice I gain 20 to 25% but not the 100% all these articles on the internet want you to believe.
Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.
Int16 is the way to go, and maybe in the future BFloat16.
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Basic NNUE questions
That's interesting, becauseJoost Buijs wrote: ↑Fri Aug 06, 2021 11:01 pm
Int8 quantization seems not worth the effort, I've written optimized SIMD code for int16 and int8, there is hardly any difference in speed between the two, maybe 10%. It's not worth the effort taking into consideration that an 8 bit quantized network usually has less accuracy. The problem is the horizontal add that takes a lot of time in either the 8 and 16 bit SIMD code.
Int16 is the way to go, and maybe in the future BFloat16.
1. the overall absolute error we're seeing from int8 quantization in Stockfish is within around 10 cp for reasonable eval magnitudes, and much lower for evals within a pawn
2. int16 quantization requires 2x the loads, 2x memory footprint, and has 2x smaller multiplication density (well, without VNNI it's closer to 1.2x, not counting loads. int8 is 4x maddubs + 2x add + 2x madd + 2x add, int16 would be 8x madd + 4x add, for the same computation, I think)
3. horizontal adds take a negligible amount of time as it's 3 hadds per 4 columns, and there are ways to implement it without needing them at all
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Basic NNUE questions
ad 1. We've actually tested a float implementation against the quantized one at fixed nodes and they were about equal with very good confidence.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 883
- Joined: Sat Mar 13, 2021 1:47 am
- Full name: Amanj Sherwany
Re: Basic NNUE questions
Oh! not sure if I understand everything written here, but I believe as soon as I have a better idea of how NN works, I should come back to thisniel5946 wrote: ↑Thu Aug 05, 2021 2:29 pm NNUE is structured such that it has an input layer with ~40k neurons and way smaller hidden layers (some are 256x32x32 i think). This can store a lot of information, but would be extremely slow to compute entirely each time a move was made/unmade. Therefore, what they do, is that when a piece moves, the neurons of the first hidden layer is updated.
When you move a piece, you simply loop over all neurons in the 1st hidden layer, subtracting the weight connecting the neuron of the piece moved and your current neuron (and do the same for the destination square). This then makes it easy to just apply a ReLU function and compute a 256x32x32 layer when evaluating.
I don't mind paying some price, but I don't want to spend a lot of time, to implement something that is very unoptimized... this is reassuringniel5946 wrote: ↑Thu Aug 05, 2021 2:29 pm I don't know how fast GoLang is, but considering the incredible strength increase many engines have had from implementing NNUE, I would at least give it a try. Additionally, AVX/SIMD instructions just seems to be an optimization to squeeze as much elo out of NNUE as possible. I don't think it impacts performance with more than +/- 50 elo (which is miniscule compared to the strength increase from NNUE).
This is actually a no go, I believe even a naiive Go-lang version will outperform that, as the overhead of calling C-functions is wayyyyy too high
Thanks a lot for your answers, I'm still learning, so it will be a while before I have something functional, but you have helped me greatly! will start the hard work
-
- Posts: 1334
- Joined: Sun Jul 17, 2011 11:14 am
Re: Basic NNUE questions
Go actually has support for its own strange flavour of assembly, and it uses this for the complicated machine-specific things; so I would suggest you write the performance-intensive section in raw assembly and then have Go call into it. Unlike cgo, assembly calls from Go are fairly cheap if you annotate them as the docs ask.
Some believe in the almighty dollar.
I believe in the almighty printf statement.
I believe in the almighty printf statement.
-
- Posts: 883
- Joined: Sat Mar 13, 2021 1:47 am
- Full name: Amanj Sherwany
Re: Basic NNUE questions
Yeah, I believe gonum is doing that... I'll go for none optimized version first, as I don't want to lose the support of other platforms... Then I'll probably start working on that "assembly" thing, if I'm courageous enoughZirconiumX wrote: ↑Sat Aug 07, 2021 2:22 am Go actually has support for its own strange flavour of assembly, and it uses this for the complicated machine-specific things; so I would suggest you write the performance-intensive section in raw assembly and then have Go call into it. Unlike cgo, assembly calls from Go are fairly cheap if you annotate them as the docs ask.