Thanks, this takes my confusion away a little bit)hgm wrote: ↑Thu Oct 15, 2020 8:20 pm Note that I just fixed a few typos in the code (well, actually copy-paste errors, where I forgot to make the intended modifications to the copied code). The weights of all layers of course had to be different, and the last layer only needs a 1-d array of weights, as there is only a single output.
Hacking around CFish NNUE
Moderators: hgm, Rebel, chrisw
-
- Posts: 775
- Joined: Sat Sep 08, 2018 5:37 pm
- Location: Ukraine
- Full name: Maksim Korzh
Re: Hacking around CFish NNUE
Didactic chess engines:
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
-
- Posts: 775
- Joined: Sat Sep 08, 2018 5:37 pm
- Location: Ukraine
- Full name: Maksim Korzh
Re: Hacking around CFish NNUE
Guys, I can't believe that eventually I've found exactly what I was looking for!
So I wanted to see the following implementation: Take FEN string as input -> get NNUE score as output
And OMG! Here it is! https://hxim.github.io/Stockfish-Evaluation-Guide/ (NNUE tab)
It allows user to upload NNUE in the browser and gives a score to whatever position is available on board! Can you believe it?!
So now I can implement it in C and embed into my engine and make a tutorial series on it!
Yes, it would be slow, inefficient but I'm interested in a proof of concept.
So thanks to everybody participating, eventually you've helped me to find the right solution.
So I wanted to see the following implementation: Take FEN string as input -> get NNUE score as output
And OMG! Here it is! https://hxim.github.io/Stockfish-Evaluation-Guide/ (NNUE tab)
It allows user to upload NNUE in the browser and gives a score to whatever position is available on board! Can you believe it?!
So now I can implement it in C and embed into my engine and make a tutorial series on it!
Yes, it would be slow, inefficient but I'm interested in a proof of concept.
So thanks to everybody participating, eventually you've helped me to find the right solution.
Didactic chess engines:
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Hacking around CFish NNUE
I just finished implementing the library without incremental updates.
https://github.com/dshawul/nnue-probe.git
It has a FEN interface and a pieces[],squares[] interface as well
Funny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
NNUE is about 65% of the speed of classic.
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.
https://github.com/dshawul/nnue-probe.git
It has a FEN interface and a pieces[],squares[] interface as well
Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile);
DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares);
DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
Code: Select all
No NNUE = 2100 knps
NNUE = 1400 knps
NNUE without increment = 1337 knps
NNUE without increment + memcpy = 1100 knps
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.
-
- Posts: 5646
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Hacking around CFish NNUE
Autovectorization might do fine on some parts but probably not on all parts. Also, some speed is gained by reordering the weights in the right way for the vector instruction set being used, which autovectorization won't be able to (for example, SSE2 storing the weights as 16-bit ints is a huge win). It may be ugly but it only needs to be done once (until the network architecture changes ).Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pmI wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
But autovectorization is worth a try if one wants the cleanest possible code. One could maybe also use the gcc vector extensions.
If you tried the default code and found it to be 3x slower, that might be because the Makefile did not enable the avx2/sse instructions.
-
- Posts: 775
- Joined: Sat Sep 08, 2018 5:37 pm
- Location: Ukraine
- Full name: Maksim Korzh
Re: Hacking around CFish NNUE
Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.
https://github.com/dshawul/nnue-probe.git
It has a FEN interface and a pieces[],squares[] interface as wellFunny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile); DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares); DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
NNUE is about 65% of the speed of classic.Code: Select all
No NNUE = 2100 knps NNUE = 1400 knps NNUE without increment = 1337 knps NNUE without increment + memcpy = 1100 knps
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.
OMG! Seems exactly what I was dreaming of!
Thank you so much, Daniel!
Can't be grateful enough!
Didactic chess engines:
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
-
- Posts: 775
- Joined: Sat Sep 08, 2018 5:37 pm
- Location: Ukraine
- Full name: Maksim Korzh
Re: Hacking around CFish NNUE
I've retrieved the score via FEN:Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.
https://github.com/dshawul/nnue-probe.git
It has a FEN interface and a pieces[],squares[] interface as wellFunny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile); DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares); DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
NNUE is about 65% of the speed of classic.Code: Select all
No NNUE = 2100 knps NNUE = 1400 knps NNUE without increment = 1337 knps NNUE without increment + memcpy = 1100 knps
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.
Code: Select all
int main()
{
nnue_init("nn-04cf2b4ed1da.nnue");
int score = nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1");
printf("score: %d\n", score);
return 0;
}
e.g. above code gives output: 108
while same network in JS interface gives: 57 (0.28)
to try it yourself you can navigate here: file:///home/maksim/Desktop/nnue.html -> go to NNUE tab, download network (I used nn-04cf2b4ed1da.nnue from https://tests.stockfishchess.org/nns)
Is this the matter of different implementations or I did something horribly wrong?
Didactic chess engines:
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Hacking around CFish NNUE
There was a bug that I just fixed with decoding FEN.maksimKorzh wrote: ↑Fri Oct 16, 2020 2:22 amI've retrieved the score via FEN:Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.
https://github.com/dshawul/nnue-probe.git
It has a FEN interface and a pieces[],squares[] interface as wellFunny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile); DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares); DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
NNUE is about 65% of the speed of classic.Code: Select all
No NNUE = 2100 knps NNUE = 1400 knps NNUE without increment = 1337 knps NNUE without increment + memcpy = 1100 knps
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.But what confuses me slightly a bit is the output score (probably I'm doing something wrong)Code: Select all
int main() { nnue_init("nn-04cf2b4ed1da.nnue"); int score = nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1"); printf("score: %d\n", score); return 0; }
e.g. above code gives output: 108
while same network in JS interface gives: 57 (0.28)
to try it yourself you can navigate here: file:///home/maksim/Desktop/nnue.html -> go to NNUE tab, download network (I used nn-04cf2b4ed1da.nnue from https://tests.stockfishchess.org/nns)
Is this the matter of different implementations or I did something horribly wrong?
Here is how you probe it from FEN
Code: Select all
from ctypes import *
nnue = cdll.LoadLibrary("libnnueprobe.so")
nnue.nnue_init("/home/daniel/Scorpio/nets-scorpio/nn-baeb9ef2d183.nnue")
score = nnue.nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1")
print "Score = ", score
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Hacking around CFish NNUE
Hmm..I belive I had all the -mavx2 etc defined without the -DUSE_AVX2 for that test but maybe I made a mistake.syzygy wrote: ↑Fri Oct 16, 2020 1:07 amAutovectorization might do fine on some parts but probably not on all parts. Also, some speed is gained by reordering the weights in the right way for the vector instruction set being used, which autovectorization won't be able to (for example, SSE2 storing the weights as 16-bit ints is a huge win). It may be ugly but it only needs to be done once (until the network architecture changes ).Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pmI wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
But autovectorization is worth a try if one wants the cleanest possible code. One could maybe also use the gcc vector extensions.
If you tried the default code and found it to be 3x slower, that might be because the Makefile did not enable the avx2/sse instructions.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Hacking around CFish NNUE
It seems the issue is the compiler.Daniel Shawul wrote: ↑Fri Oct 16, 2020 2:52 amHmm..I belive I had all the -mavx2 etc defined without the -DUSE_AVX2 for that test but maybe I made a mistake.syzygy wrote: ↑Fri Oct 16, 2020 1:07 amAutovectorization might do fine on some parts but probably not on all parts. Also, some speed is gained by reordering the weights in the right way for the vector instruction set being used, which autovectorization won't be able to (for example, SSE2 storing the weights as 16-bit ints is a huge win). It may be ugly but it only needs to be done once (until the network architecture changes ).Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pmI wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
But autovectorization is worth a try if one wants the cleanest possible code. One could maybe also use the gcc vector extensions.
If you tried the default code and found it to be 3x slower, that might be because the Makefile did not enable the avx2/sse instructions.
clang does vectorization well and now the slowdown is about 1.7x.
OTOH, gcc version 7.5 seems to have a problem unless I missed some flag
This doesn't report any vectorized loops.gcc -c -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 -msse2 -msse -mavx2 vect.c
GCC is actually 6.2x slower when I don't do incremental update, and with incremental update it is about 3x slower.
For clang, I used
and it does report vecorized loops and is fast.clang -c -O3 -ftree-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize vect.c
-
- Posts: 775
- Joined: Sat Sep 08, 2018 5:37 pm
- Location: Ukraine
- Full name: Maksim Korzh
Re: Hacking around CFish NNUE
Thank you so much for fixing it!Daniel Shawul wrote: ↑Fri Oct 16, 2020 2:49 amThere was a bug that I just fixed with decoding FEN.maksimKorzh wrote: ↑Fri Oct 16, 2020 2:22 amI've retrieved the score via FEN:Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.
https://github.com/dshawul/nnue-probe.git
It has a FEN interface and a pieces[],squares[] interface as wellFunny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile); DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares); DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
NNUE is about 65% of the speed of classic.Code: Select all
No NNUE = 2100 knps NNUE = 1400 knps NNUE without increment = 1337 knps NNUE without increment + memcpy = 1100 knps
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.But what confuses me slightly a bit is the output score (probably I'm doing something wrong)Code: Select all
int main() { nnue_init("nn-04cf2b4ed1da.nnue"); int score = nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1"); printf("score: %d\n", score); return 0; }
e.g. above code gives output: 108
while same network in JS interface gives: 57 (0.28)
to try it yourself you can navigate here: file:///home/maksim/Desktop/nnue.html -> go to NNUE tab, download network (I used nn-04cf2b4ed1da.nnue from https://tests.stockfishchess.org/nns)
Is this the matter of different implementations or I did something horribly wrong?
Here is how you probe it from FENCode: Select all
from ctypes import * nnue = cdll.LoadLibrary("libnnueprobe.so") nnue.nnue_init("/home/daniel/Scorpio/nets-scorpio/nn-baeb9ef2d183.nnue") score = nnue.nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1") print "Score = ", score
your snippet is in python, but can't I use the lib in C? I head some errors when imported nnue.h but then I just added main() to nnue.cpp file and called nnue_evaluate_fen() from there. In future I just want ro compile nnue.cpp and misc.cpp along with my engine, is that ok?
Didactic chess engines:
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ
https://www.chessprogramming.org/Maksim_Korzh
Chess programming YouTube channel:
https://www.youtube.com/channel/UCB9-pr ... KKqDgXhsMQ