Making NNUE 60% faster
Moderator: Ras
-
- Posts: 391
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Making NNUE 60% faster
please, just don't kill me for being potentially off by 1
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 1955
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: Making NNUE 60% faster
I did not change a damn thing. You just quoted the definition of maddubs, and then used it wrong. Even though you had the answer right in front of you! Brainless.dangi12012 wrote: ↑Tue Jun 13, 2023 8:53 pm Sopel of course strategically ignored this sentence completely:
"Valid in the context of FeatureTransform * L0 Transformation because of [0, 126] limit."
Which is incidentally the biggest layer and the one we care about the most.
Andrew changed his mind from "all of what is posted here is bunk" to - everyone knew it all along. Which is textbook gaslighting.
I rest my case - this thread is finished now.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Making NNUE 60% faster
Quick update: Here is the code I am using to update the indices.
It shaves off a few instruction for each move considered.
I am actually quite proud to find these improvements and I dont share any sentiment of "oh its only 1%" and such things because if you find 10 of those suddenly you make a huge leap. Moreover I also repeatedly find that by refactoring code many times over you suddenly realize some ways to consolidate calculations etc.. leading to 60% overall which will be shown in a comparison repo. Yes its isolated but no other way to do comparisons of implementations then in isolation since code runs in different context with differing cache pressure etc in every engine.
I am also making good progress of porting everything to cuda since for the last few weeks FINALLY CUDA Toolkit 12.1 supports C++20 which includes the <bit> header. Which means we dont need to #ifdef everything but can use std::popcount() natively inside __Device__ code.
https://godbolt.org/z/4WhEddEr4
You can see how activate_feature2 is a bit shorter and efficient compared to activate_feature with no downside whatsoever.
Here is the code I am using to initialize NNUE:
Where we rotate by 90° to get into color agnostic chess BB representation:
This also contains the efficient code to map generic piece codes to NNUE piece codes. Since engines and nnue dont necesarily agree what a piece = 3 actually is.
The reason for a 90° rotated board is that when pawns walk from left to right: pawn >>= 1 instead of pawn upwards with <<= 8 we can completely eliminate the check for the edge files because shifting above 64bit or below does eliminate that bit for free and we dont wrap around.
https://github.com/official-stockfish/S ... ard.h#L165
With color agnosticity and 90° rotation all of above URL becomes this. Successful removal of a template parameter + no more masking. And while template branches do not incur a runtime cost we have increased the available instruction cache for other stuff.
atk_d1 = https://tearth.dev/bitboard-viewer/ - layout 2 - 9241421688590303745 and thats how pawns take diagonally here and move with >> 1.
All quite fun stuff.
It shaves off a few instruction for each move considered.
I am actually quite proud to find these improvements and I dont share any sentiment of "oh its only 1%" and such things because if you find 10 of those suddenly you make a huge leap. Moreover I also repeatedly find that by refactoring code many times over you suddenly realize some ways to consolidate calculations etc.. leading to 60% overall which will be shown in a comparison repo. Yes its isolated but no other way to do comparisons of implementations then in isolation since code runs in different context with differing cache pressure etc in every engine.
I am also making good progress of porting everything to cuda since for the last few weeks FINALLY CUDA Toolkit 12.1 supports C++20 which includes the <bit> header. Which means we dont need to #ifdef everything but can use std::popcount() natively inside __Device__ code.
https://godbolt.org/z/4WhEddEr4
Code: Select all
//asm proof https://godbolt.org/z/4WhEddEr4
struct nnue_idx
{
int16_t* weight;
int32_t* psqt;
};
//Square, King, Piece => index lookups
static inline nnue_idx idx_own[12][64 * 64];
static inline nnue_idx idx_opp[12][64 * 64];
//Continuous memory accumulator for both colors, contains all state that nnue needs to keep track of
struct Accumulator {
alignas(64) std::int16_t accumulation[2 * traits];
std::int32_t psqt[2 * 8];
Piece occupancy[64];
const inline int32_t* own_psqt() const noexcept { return psqt; }
const inline int32_t* opp_psqt() const noexcept { return psqt + dims; }
};
Here is the code I am using to initialize NNUE:
Where we rotate by 90° to get into color agnostic chess BB representation:
Code: Select all
// INIT
static bool Init(const char* fname) {
std::string description;
//All of this is just to load weights and set the index lookup used in nnue.
//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]);
//Gigantua schema: feature_idx_own[pc][sq];
//2nd optimisation: return pointer instead of index!
//rotate king to right half of the board. required by nnue
int king_orient[2][64] = {
{ 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0,
7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0 },
{ 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56,
63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56 }
};
//Multiplied by 64 gives the right piece offset into NNUE
int piece_offset[2][16] = {
{ 0, 0, 2, 4, 6, 8, 10, 0, 0, 1, 3, 5, 7, 9, 10, 0 },
{ 0, 1, 3, 5, 7, 9, 10, 0, 0, 0, 2, 4, 6, 8, 10, 0 }
};
//flipDiagA8H1 of original definitions, gives the right king offset into NNUE
int king_buckets[2][64] = {
{ 0, 2816, 5632, 8448, 11264, 14080, 16896, 19712, 704, 3520, 6336, 9152, 11968, 14784, 17600, 20416, 1408, 4224, 7040, 9856, 12672, 15488, 18304, 21120, 2112, 4928, 7744, 10560, 13376, 16192, 19008, 21824,
2112, 4928, 7744, 10560, 13376, 16192, 19008, 21824, 1408, 4224, 7040, 9856, 12672, 15488, 18304, 21120, 704, 3520, 6336, 9152, 11968, 14784, 17600, 20416, 0, 2816, 5632, 8448, 11264, 14080, 16896, 19712 },
{ 19712, 16896, 14080, 11264, 8448, 5632, 2816, 0, 20416, 17600, 14784, 11968, 9152, 6336, 3520, 704, 21120, 18304, 15488, 12672, 9856, 7040, 4224, 1408, 21824, 19008, 16192, 13376, 10560, 7744, 4928, 2112,
21824, 19008, 16192, 13376, 10560, 7744, 4928, 2112, 21120, 18304, 15488, 12672, 9856, 7040, 4224, 1408, 20416, 17600, 14784, 11968, 9152, 6336, 3520, 704, 19712, 16896, 14080, 11264, 8448, 5632, 2816, 0 }
};
int piece_rotate[2][64] = {
{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7},
{ 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56,
63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63}
};
//map from 0..16 into correct nnue piece code
int piece_list[12] = { 1,2,3,4,5,6,9,10,11,12,13,14 };
for (int pc = 0; pc < 16; pc++) {
for (int ksq = 0; ksq < 64; ksq++) {
for (int sq = 0; sq < 64; sq++) {
int sq_flip = (((sq >> 3) | (sq << 3)) & 63) ^ 63; //flip sq too
int idx = (ksq << 6) | sq_flip;
int px = std::distance(piece_list, std::find(piece_list, piece_list + 12, pc));
if (px >= 12) continue; //we condense 16 into 12
//Move lea and shift by 12 to initialisation time!
int own_idx = ((sq ^ piece_rotate[0][ksq]) + piece_offset[0][pc] * 64 + king_buckets[0][ksq]);
int opp_idx = ((sq ^ piece_rotate[1][ksq]) + piece_offset[1][pc] * 64 + king_buckets[1][ksq]);
idx_own[px][idx] =
{
FeatureTransform<simdlevel>::weights + traits * own_idx,
FeatureTransform<simdlevel>::psqtWeights + dims * own_idx
};
idx_opp[px][idx] =
{
FeatureTransform<simdlevel>::weights + traits * opp_idx,
FeatureTransform<simdlevel>::psqtWeights + dims * opp_idx
};
}
}
}
std::ifstream nnue_file(fname, std::ios::binary);
if (!read_header(nnue_file, description)) return false;
if (!FeatureTransform<simdlevel>::read_parameters(nnue_file)) return false;
for (std::size_t i = 0; i < dims; ++i)
{
if (!network[i].read_parameters(nnue_file)) return false;
}
std::ignore = nnue_file.peek(); //EOF is only true after peeking
return nnue_file && nnue_file.eof();
}
The reason for a 90° rotated board is that when pawns walk from left to right: pawn >>= 1 instead of pawn upwards with <<= 8 we can completely eliminate the check for the edge files because shifting above 64bit or below does eliminate that bit for free and we dont wrap around.
https://github.com/official-stockfish/S ... ard.h#L165
With color agnosticity and 90° rotation all of above URL becomes this. Successful removal of a template parameter + no more masking. And while template branches do not incur a runtime cost we have increased the available instruction cache for other stuff.
atk_d1 = https://tearth.dev/bitboard-viewer/ - layout 2 - 9241421688590303745 and thats how pawns take diagonally here and move with >> 1.
Code: Select all
bb_func pawn_atk_d1(uint64_t pawns, uint64_t enemy) {
return (pawns >> 9) & enemy; //d1 9241421688590303745
}
bb_func pawn_atk_d2(uint64_t pawns, uint64_t enemy) {
return (pawns << 7) & enemy;
}
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 56
- Joined: Tue Sep 14, 2021 12:29 am
- Full name: .
Re: Making NNUE 60% faster
We've been waiting many months for this + a successful Fishtest test...dangi12012 wrote: ↑Fri Jun 23, 2023 6:31 pm leading to 60% overall which will be shown in a comparison repo

-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Making NNUE 60% faster
Let me make this Perfectly clear:ImNotStockfish wrote: ↑Sun Jul 23, 2023 12:32 pmWe've been waiting many months for this + a successful Fishtest test...dangi12012 wrote: ↑Fri Jun 23, 2023 6:31 pm leading to 60% overall which will be shown in a comparison repo![]()
Your question actually shows an honest interest in that topic and I will release the comparison repo.
A fishtest PR is not what I do or will do. There will be repo with a conceptual interface (C++ concept) of piece_added, piece_removed, piece_move and behind it we will compare SF nnue, my nnue and other binary compatible re-implementations (if they even exist) in a multithreaded manner in single files without external dependencies.
What I did do is look at the raw idea of NNUE and implement it from scratch starting from this screenshot only and finished in April 2023.
https://github.com/glinscott/nnue-pytor ... led_v2.svg
I have compared it to the SF implementation of SFNNv5 and found out a 60% performance difference.
If you look into the SF implementation you see memcpy, mixing of AVX2 and SSE code in L1 and among other things an accumulator split into 2 parts.
With some simple reshuffling upon load you get a more optimal memory model and thus more performance MEvals/s in propagate.
Layers emit into a single memory location and no memcpy is needed when you do that into the right parts.
I wont reiterate all ideas they are in the thread - I provided many links to compiler explorer which proves better assembly without any downside whatsoever.
For instance my optimized propagation code lives in a singe function:
Code: Select all
#define when_avx2 requires std::is_same_v<simdlevel, T>
static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0) noexcept when_avx2
*w0++
Let me make this Perfectly clear Part II:
Propagate even with all the ideas above is fairly optimized, 10% improvement was possible there with the information above.
All other improvements come from having to do the incremental nnue parts only HALF of the time if you consolidate the accumulator added features and removed features.
Think of it how similarly you could consolidate this by using more memory.
Code: Select all
//We care about these two loops and one of them can be removed when we consolidate activated and deactivated features for the permutation of taking and silent moves
for (unsigned k = 0; k < NumRegs; ++k)
acc[k] = vec_add_16(acc[k], column[k]);
}
for (IndexType k = 0; k < NumRegs; ++k)
acc[k] = vec_sub_16(acc[k], column[k]);
}
What I will do before that is finish NNUE in Cuda because I think 60% uplift from SF is not even close to what is possible.
My dream is essentially each cuda core having its own part of global memory for its own private accumulator and running MCTS or AB without thread divergence.
Todo list dangi12012:
publication of my fast NNUE impl.
publication of advanced Galoisfield movegen
publication of Gigantua V2 taking advantage of SIMD board - SIMD movegen and SIMD moveapplication for twice the PERFT of the one in my signature but WITHOUT any templates. - twice the speed of Gigantua - no bulk counting - no templates
publication of NNUEv5 single file.
Finishing of Leorik C# NNUE codepath.
Other todos: https://www.talkchess.com/forum3/viewto ... 10#p950354
TLDR:
Will release once I can say NNUE runs x amount faster on the gpu which will be more than 60% garuanteed

Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 195
- Joined: Thu Feb 04, 2021 10:24 pm
- Full name: Arnold Magnum
Re: Making NNUE 60% faster
It will probably be easier to improve the speed of Stockfish by 500% to 1000% on Apple devices with M1, M2, M3, pro, max, ultra chip, than making NNUE 60% faster on other devices.dangi12012 wrote: ↑Tue Jun 06, 2023 6:26 pm For the past few years people said that movegen is not where engines spend most of their time - which of course is true, but the design philosophy of taking an idea (not the implementation) and re-imagine memory layouts, intrinsics etc. and pushing it to the fullest extent is always a nice thing to have.
So in that spirit I have taken a few months and re-implemented NNUE from scratch but just starting from the screenshot here:
https://github.com/glinscott/nnue-pytor ... chitecture
So what I want to share is applicable for AVX2 and some of AVX512. I want to share what is possible without spilling the source code for it. I was actually suprised that the most innermost hot loop of NNUE can be optimized for all intrinsic AVX2, AVX512 types and it was a very fun journey.
After I was done I compared what I have with what the official SF repo has implemented for the same ideas.
The incremental part of NNUE is optimal with good intrinsics etc. (I mean the part where we create maintain 1536 int16_t from incrementally updating the state, into what they call accumulators) having two of them is already not optimal since both of them can fit in one linear array for input into the next part.
From a architecture perspective I can tell you that ALL of NNUE (including init + file loads and hash comparison) can fit in around 150 lines of code with all of the actual non incremental code propagate_layers fitting in 54 lines of readable C++ code.
Of course if compilers were perfect we would be done here, but even clang does not 1) reorder memory read from a file to reshuffle in order to use optimal intrinsics, 2) does not emit optimal intrinsic from loops to begin with. So add around 60 loc for AVX2.
All being said it fits in a single file which makes it maintainable and as a .h there is no linking making clang much more efficient compared to .cpp + .h.
Optimisations missing from SF repo
0) Board Layout. I had the luck in gigantua my layout is color agnostic making the players pawn always move like this: pawn >>= 1.
Incidentally it seems that NNUE prefers this as well an SF has to go through some hoops to align indices.
0a) Memory Layout. Having optimal board layout means that the binary files are not compatible and need to be reshuffled. Here I can show you exactly what my philosophy is, and I can make choices here that are not possible otherwise.
This code is called for each and every change in nnue which is between 2x and 6x per movesource:Code: Select all
//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]); //Gigantua schema: feature_idx_own[pc][sq]; //2nd optimisation: return pointer instead of index!
https://github.com/official-stockfish/S ... hm.cpp#L30
Worse, the indices get put in a list which is unnecessary when using a template visitor pattern. Making the implementation of the same idea 20x faster for that snippet. 1 function, 4 lookups, some multiplications get all replaced by a instant 2d lookup.
Also returning a pointer directly is a nice speedup compared to returning an index in this case.
1) These definitions lead to that the compiler has to iterate over multiple indices which does not get optimized away even in O3. You can consolidate AffineTransform and Relu into a single function.
https://github.com/official-stockfish/S ... ure.h#L117
Going forward we can consolidate all layers into a simple function definition, and this is invoked with a up-to-date accumulator pointer. Notice how that is a single pointer even when nnue updates both colors.Which correctly returns the material bypass value material = (material * 600 * 16) / (127 * (1 << weight_scale)) + positional bypass + eval as defined in V6 network.Code: Select all
static inline int propagate_layers(const std::int16_t* acc, int8_t* w0, int32_t* b0, int8_t* w1, int32_t* b1, int8_t* w2, int32_t* b2) noexcept
Expanding on that idea we can even consolidate and shuffle the memory layout of the weights to have all weights in a linear and padded layout perfect for AVX2 or AVX512:2)Code: Select all
static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0)
These buffers can be removed completely - you dont need them and work with registers directly, incidentally the maximum usage is much smaller than defined here and fits in registers.
https://github.com/official-stockfish/S ... ture.h#L95
3) Memcpy is used which is quite slow when the domain already contracts (the stdlib cannot assume and has to run a few ifs) that pointers are aligned an non overlapping.
https://godbolt.org/z/15oYqMKjn
4)
NNUE weights can be calculated faster by skipping some intrinsics for AVX2.
Making this much faster: https://github.com/official-stockfish/S ... 40-L211C40
This is a throwaway sentence above but its the most important part right here. If you read this, it has maybe 30% of the overall impact.
If you are a SF developer read this sentence and you will understand instantly. Applicable for all AVX, AVX512 except for VNNI (then its good).
The relu layer clips the inputs to 0..128, making the transformation from packed 16bits to 32bit accumulators not necessary every iteration
So you dont need _mm256_madd_epi16 every iteration. Only on every 32th iteration overflow is possible. Skipping all of these intrinsics leads to
this perfectly: acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));
For 8 accumulators. Using 8 accumulators instead of 4 has another advantage:
5)
Without register spilling its possible to increase internal accumulators to 8 making this function m256_haddx8 - that allows this function to never mix SSE and AVX which is a slowdown.
https://github.com/official-stockfish/S ... #LL196C37-
For this one I can share my code.
Of course my style is to overload this function so it does more than it says. (adding biases for example)All in all I want to share my performance log. This is inferences per second on random positions (so including full accumulator rebuild and no incrementality) on a single thread.Code: Select all
static inline __m256i accumulator_reduce(__m256i accs[8], __m256i bias) { const __m256i one = _mm256_set1_epi16(1); accs[0] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[0], one), _mm256_madd_epi16(accs[1], one)); accs[1] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[2], one), _mm256_madd_epi16(accs[3], one)); accs[2] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[4], one), _mm256_madd_epi16(accs[5], one)); accs[3] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[6], one), _mm256_madd_epi16(accs[7], one)); //a0 a1 a2 a3; b0 b1 b2 b3; c0 c1 c2 c3; d0 d1 d2 d3; a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7 //e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3; e4 e5 e6 e7; f4 f5 f6 f7; g4 g5 g6 g7; h4 h5 h6 h7 //a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7; e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3 accs[0] = _mm256_hadd_epi32(accs[0], accs[1]); accs[1] = _mm256_hadd_epi32(accs[2], accs[3]); accs[2] = _mm256_permute2x128_si256(accs[0], accs[1], 0b100001); //Blend and add bias return _mm256_add_epi32(bias, _mm256_blend_epi32( _mm256_add_epi32(accs[0], accs[2]), _mm256_add_epi32(accs[1], accs[2]), 0b11110000)); }
Disclaimer: some of this is not applicable when VNNI is available but most of it is, and I cant say what the improved memory layout does for NEON but going from many pointers into a 2 aligned SIMD pointers should help.
Going forward I can say that the post format here is too limited and I should finally get around and publish all knowledge on a seperate blog-like website. All in all I described 7 ideas relevant to NNUE performance of around 35 that increased performance.Code: Select all
//17.04.23 0.04 MNPS //17.04.23 0.045 MNPS //17.04.23 0.054 MNPS //18.04.23 0.054 MNPS //18.04.23 0.146 MNPS //20.04.23 0.253 MNPS //20.04.23 0.262 MNPS //21.04.23 0.266 MNPS //22.04.23 0.269 MNPS //25.04.23 3.067 MNPS //27.04.23 3.320 MNPS //27.04.23 3.370 MNPS //29.04.23 4.450 MNPS //01.05.23 4.491 MNPS //02.05.23 4.712 MNPS
It boils down to:
1) having all readable inside a single function with const and non const pointers and no outside references etc. (helps the compiler A LOT)
2) improving memory layout and reshuffle some weights to get from incremental layer to output faster with better intrinsics
3) decreasing the overall cost of intrinsics by finding redundancies from domain knowledge. (For example knowing a value is strictly below 128 and suddenly you can remove some instructions because it cannot overflow an integer)
-
- Posts: 2696
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Making NNUE 60% faster
Wishful thinking. The AI stuff is badly documented and very limited in what network architectures it can deal with, and the point of NNUE is not needing support from a GPU, and even if, Nvidia would be the way.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net