Making NNUE 60% faster

Sopel · Post by **Sopel** » Tue Jun 13, 2023 10:05 pm

please, just don't kill me for being potentially off by 1

AndrewGrant · Post by **AndrewGrant** » Wed Jun 14, 2023 1:18 am

dangi12012 wrote: ↑Tue Jun 13, 2023 8:53 pm Sopel of course strategically ignored this sentence completely:
"Valid in the context of FeatureTransform * L0 Transformation because of [0, 126] limit."
Which is incidentally the biggest layer and the one we care about the most.

Andrew changed his mind from "all of what is posted here is bunk" to - everyone knew it all along. Which is textbook gaslighting.

I rest my case - this thread is finished now.

I did not change a damn thing. You just quoted the definition of maddubs, and then used it wrong. Even though you had the answer right in front of you! Brainless.

dangi12012 · Post by **dangi12012** » Fri Jun 23, 2023 6:31 pm

Quick update: Here is the code I am using to update the indices.
It shaves off a few instruction for each move considered.

I am actually quite proud to find these improvements and I dont share any sentiment of "oh its only 1%" and such things because if you find 10 of those suddenly you make a huge leap. Moreover I also repeatedly find that by refactoring code many times over you suddenly realize some ways to consolidate calculations etc.. leading to 60% overall which will be shown in a comparison repo. Yes its isolated but no other way to do comparisons of implementations then in isolation since code runs in different context with differing cache pressure etc in every engine.

I am also making good progress of porting everything to cuda since for the last few weeks FINALLY CUDA Toolkit 12.1 supports C++20 which includes the <bit> header. Which means we dont need to #ifdef everything but can use std::popcount() natively inside __Device__ code.

https://godbolt.org/z/4WhEddEr4

Code: Select all

//asm proof https://godbolt.org/z/4WhEddEr4
struct nnue_idx
{
	int16_t* weight;
	int32_t* psqt;
};

//Square, King, Piece => index lookups
static inline nnue_idx idx_own[12][64 * 64];
static inline nnue_idx idx_opp[12][64 * 64];

//Continuous memory accumulator for both colors, contains all state that nnue needs to keep track of
struct Accumulator {
	alignas(64) std::int16_t accumulation[2 * traits];
	std::int32_t psqt[2 * 8];
	Piece occupancy[64];

	const inline int32_t* own_psqt() const noexcept { return psqt; }
	const inline int32_t* opp_psqt() const noexcept { return psqt + dims; }
};

You can see how activate_feature2 is a bit shorter and efficient compared to activate_feature with no downside whatsoever.

Here is the code I am using to initialize NNUE:
Where we rotate by 90° to get into color agnostic chess BB representation:

Code: Select all

// INIT
	static bool Init(const char* fname) {
		std::string description;

		//All of this is just to load weights and set the index lookup used in nnue. 
		//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]);
		//Gigantua schema: feature_idx_own[pc][sq];
		//2nd optimisation: return pointer instead of index!

		//rotate king to right half of the board. required by nnue
		int king_orient[2][64] = {
		  { 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0,
			7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0, 7, 7, 7, 7, 0, 0, 0, 0 },
		  { 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56,
			63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56, 63, 63, 63, 63, 56, 56, 56, 56 }
		};

		//Multiplied by 64 gives the right piece offset into NNUE
		int piece_offset[2][16] = {
			{ 0, 0, 2, 4, 6, 8, 10, 0, 0, 1, 3, 5, 7, 9, 10, 0 },
			{ 0, 1, 3, 5, 7, 9, 10, 0, 0, 0, 2, 4, 6, 8, 10, 0 }
		};

		//flipDiagA8H1 of original definitions, gives the right king offset into NNUE
		int king_buckets[2][64] = {
		  { 0, 2816, 5632, 8448, 11264, 14080, 16896, 19712, 704, 3520, 6336, 9152, 11968, 14784, 17600, 20416, 1408, 4224, 7040, 9856, 12672, 15488, 18304, 21120, 2112, 4928, 7744, 10560, 13376, 16192, 19008, 21824,
		    2112, 4928, 7744, 10560, 13376, 16192, 19008, 21824, 1408, 4224, 7040, 9856, 12672, 15488, 18304, 21120, 704, 3520, 6336, 9152, 11968, 14784, 17600, 20416, 0, 2816, 5632, 8448, 11264, 14080, 16896, 19712 },
		  { 19712, 16896, 14080, 11264, 8448, 5632, 2816, 0, 20416, 17600, 14784, 11968, 9152, 6336, 3520, 704, 21120, 18304, 15488, 12672, 9856, 7040, 4224, 1408, 21824, 19008, 16192, 13376, 10560, 7744, 4928, 2112,
		    21824, 19008, 16192, 13376, 10560, 7744, 4928, 2112, 21120, 18304, 15488, 12672, 9856, 7040, 4224, 1408, 20416, 17600, 14784, 11968, 9152, 6336, 3520, 704, 19712, 16896, 14080, 11264, 8448, 5632, 2816, 0 }
		};

		int piece_rotate[2][64] = {
		  { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
			7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7},
		  { 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56,
			63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63}
		};

		//map from 0..16 into correct nnue piece code
		int piece_list[12] = { 1,2,3,4,5,6,9,10,11,12,13,14 };

		for (int pc = 0; pc < 16; pc++) {
			for (int ksq = 0; ksq < 64; ksq++) {
				for (int sq = 0; sq < 64; sq++) {
					int sq_flip = (((sq >> 3) | (sq << 3)) & 63) ^ 63; //flip sq too

					int idx = (ksq << 6) | sq_flip;
					int px = std::distance(piece_list, std::find(piece_list, piece_list + 12, pc));
					if (px >= 12) continue; //we condense 16 into 12

					//Move lea and shift by 12 to initialisation time!
					int own_idx = ((sq ^ piece_rotate[0][ksq]) + piece_offset[0][pc] * 64 + king_buckets[0][ksq]);
					int opp_idx = ((sq ^ piece_rotate[1][ksq]) + piece_offset[1][pc] * 64 + king_buckets[1][ksq]);
					idx_own[px][idx] =
					{
						FeatureTransform<simdlevel>::weights + traits * own_idx,
						FeatureTransform<simdlevel>::psqtWeights + dims * own_idx
					};
					idx_opp[px][idx] =
					{
						FeatureTransform<simdlevel>::weights + traits * opp_idx,
						FeatureTransform<simdlevel>::psqtWeights + dims * opp_idx
					};
				}
			}
		}

		std::ifstream nnue_file(fname, std::ios::binary);
		if (!read_header(nnue_file, description)) return false;

		if (!FeatureTransform<simdlevel>::read_parameters(nnue_file)) return false;
		for (std::size_t i = 0; i < dims; ++i)
		{
			if (!network[i].read_parameters(nnue_file)) return false;
		}
		std::ignore = nnue_file.peek(); //EOF is only true after peeking
		return nnue_file && nnue_file.eof();
	}

This also contains the efficient code to map generic piece codes to NNUE piece codes. Since engines and nnue dont necesarily agree what a piece = 3 actually is.

The reason for a 90° rotated board is that when pawns walk from left to right: pawn >>= 1 instead of pawn upwards with <<= 8 we can completely eliminate the check for the edge files because shifting above 64bit or below does eliminate that bit for free and we dont wrap around.
https://github.com/official-stockfish/S ... ard.h#L165

With color agnosticity and 90° rotation all of above URL becomes this. Successful removal of a template parameter + no more masking. And while template branches do not incur a runtime cost we have increased the available instruction cache for other stuff.

atk_d1 = https://tearth.dev/bitboard-viewer/ - layout 2 - 9241421688590303745 and thats how pawns take diagonally here and move with >> 1.

Code: Select all

bb_func pawn_atk_d1(uint64_t pawns, uint64_t enemy) {
	return (pawns >> 9) & enemy; //d1 9241421688590303745
}

bb_func pawn_atk_d2(uint64_t pawns, uint64_t enemy) {
	return (pawns << 7) & enemy;
}

All quite fun stuff.

ImNotStockfish · Post by **ImNotStockfish** » Sun Jul 23, 2023 12:32 pm

dangi12012 wrote: ↑Fri Jun 23, 2023 6:31 pm leading to 60% overall which will be shown in a comparison repo

We've been waiting many months for this + a successful Fishtest test...

dangi12012 · Post by **dangi12012** » Sun Jul 23, 2023 7:31 pm

ImNotStockfish wrote: ↑Sun Jul 23, 2023 12:32 pm
dangi12012 wrote: ↑Fri Jun 23, 2023 6:31 pm leading to 60% overall which will be shown in a comparison repo
We've been waiting many months for this + a successful Fishtest test...

Let me make this Perfectly clear:
Your question actually shows an honest interest in that topic and I will release the comparison repo.
A fishtest PR is not what I do or will do. There will be repo with a conceptual interface (C++ concept) of piece_added, piece_removed, piece_move and behind it we will compare SF nnue, my nnue and other binary compatible re-implementations (if they even exist) in a multithreaded manner in single files without external dependencies.

What I did do is look at the raw idea of NNUE and implement it from scratch starting from this screenshot only and finished in April 2023.
https://github.com/glinscott/nnue-pytor ... led_v2.svg

I have compared it to the SF implementation of SFNNv5 and found out a 60% performance difference.
If you look into the SF implementation you see memcpy, mixing of AVX2 and SSE code in L1 and among other things an accumulator split into 2 parts.

With some simple reshuffling upon load you get a more optimal memory model and thus more performance MEvals/s in propagate.
Layers emit into a single memory location and no memcpy is needed when you do that into the right parts.
I wont reiterate all ideas they are in the thread - I provided many links to compiler explorer which proves better assembly without any downside whatsoever.

For instance my optimized propagation code lives in a singe function:

Code: Select all

#define when_avx2 requires std::is_same_v<simdlevel, T>
static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0) noexcept when_avx2

which makes it VERY readable and look at the parameters: linear memory all the way. Every time any weight is accessed it is:
*w0++

Let me make this Perfectly clear Part II:
Propagate even with all the ideas above is fairly optimized, 10% improvement was possible there with the information above.
All other improvements come from having to do the incremental nnue parts only HALF of the time if you consolidate the accumulator added features and removed features.
Think of it how similarly you could consolidate this by using more memory.

Code: Select all

//We care about these two loops and one of them can be removed when we consolidate activated and deactivated features for the permutation of taking and silent moves
for (unsigned k = 0; k < NumRegs; ++k)
    acc[k] = vec_add_16(acc[k], column[k]);
 }
 for (IndexType k = 0; k < NumRegs; ++k)
              acc[k] = vec_sub_16(acc[k], column[k]);
}

Actually this idea is more important than EVER because SFNNv7 made the amount of features twice compared to SFNNv5 (2048 vs 1024)

What I will do before that is finish NNUE in Cuda because I think 60% uplift from SF is not even close to what is possible.
My dream is essentially each cuda core having its own part of global memory for its own private accumulator and running MCTS or AB without thread divergence.

Todo list dangi12012:
publication of my fast NNUE impl.
publication of advanced Galoisfield movegen
publication of Gigantua V2 taking advantage of SIMD board - SIMD movegen and SIMD moveapplication for twice the PERFT of the one in my signature but WITHOUT any templates. - twice the speed of Gigantua - no bulk counting - no templates
publication of NNUEv5 single file.
Finishing of Leorik C# NNUE codepath.
Other todos: https://www.talkchess.com/forum3/viewto ... 10#p950354

TLDR:
Will release once I can say NNUE runs x amount faster on the gpu which will be more than 60% garuanteed

Magnum · Post by **Magnum** » Sun Jul 30, 2023 5:18 pm

dangi12012 wrote: ↑Tue Jun 06, 2023 6:26 pm For the past few years people said that movegen is not where engines spend most of their time - which of course is true, but the design philosophy of taking an idea (not the implementation) and re-imagine memory layouts, intrinsics etc. and pushing it to the fullest extent is always a nice thing to have.
So in that spirit I have taken a few months and re-implemented NNUE from scratch but just starting from the screenshot here:
https://github.com/glinscott/nnue-pytor ... chitecture

So what I want to share is applicable for AVX2 and some of AVX512. I want to share what is possible without spilling the source code for it. I was actually suprised that the most innermost hot loop of NNUE can be optimized for all intrinsic AVX2, AVX512 types and it was a very fun journey.

After I was done I compared what I have with what the official SF repo has implemented for the same ideas.
The incremental part of NNUE is optimal with good intrinsics etc. (I mean the part where we create maintain 1536 int16_t from incrementally updating the state, into what they call accumulators) having two of them is already not optimal since both of them can fit in one linear array for input into the next part.

From a architecture perspective I can tell you that ALL of NNUE (including init + file loads and hash comparison) can fit in around 150 lines of code with all of the actual non incremental code propagate_layers fitting in 54 lines of readable C++ code.

Of course if compilers were perfect we would be done here, but even clang does not 1) reorder memory read from a file to reshuffle in order to use optimal intrinsics, 2) does not emit optimal intrinsic from loops to begin with. So add around 60 loc for AVX2.
All being said it fits in a single file which makes it maintainable and as a .h there is no linking making clang much more efficient compared to .cpp + .h.

Optimisations missing from SF repo

0) Board Layout. I had the luck in gigantua my layout is color agnostic making the players pawn always move like this: pawn >>= 1.
Incidentally it seems that NNUE prefers this as well an SF has to go through some hoops to align indices.

0a) Memory Layout. Having optimal board layout means that the binary files are not compatible and need to be reshuffled. Here I can show you exactly what my philosophy is, and I can make choices here that are not possible otherwise.

This code is called for each and every change in nnue which is between 2x and 6x per move
Code: Select all
//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]);
//Gigantua schema: feature_idx_own[pc][sq];
//2nd optimisation: return pointer instead of index!
source:
https://github.com/official-stockfish/S ... hm.cpp#L30

Worse, the indices get put in a list which is unnecessary when using a template visitor pattern. Making the implementation of the same idea 20x faster for that snippet. 1 function, 4 lookups, some multiplications get all replaced by a instant 2d lookup.
Also returning a pointer directly is a nice speedup compared to returning an index in this case.

1) These definitions lead to that the compiler has to iterate over multiple indices which does not get optimized away even in O3. You can consolidate AffineTransform and Relu into a single function.
https://github.com/official-stockfish/S ... ure.h#L117

Going forward we can consolidate all layers into a simple function definition, and this is invoked with a up-to-date accumulator pointer. Notice how that is a single pointer even when nnue updates both colors.
Code: Select all
static inline int propagate_layers(const std::int16_t* acc, int8_t* w0, int32_t* b0, int8_t* w1, int32_t* b1, int8_t* w2, int32_t* b2) noexcept
Which correctly returns the material bypass value material = (material * 600 * 16) / (127 * (1 << weight_scale)) + positional bypass + eval as defined in V6 network.

Expanding on that idea we can even consolidate and shuffle the memory layout of the weights to have all weights in a linear and padded layout perfect for AVX2 or AVX512:
Code: Select all
static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0)
2)
These buffers can be removed completely - you dont need them and work with registers directly, incidentally the maximum usage is much smaller than defined here and fits in registers.
https://github.com/official-stockfish/S ... ture.h#L95

3) Memcpy is used which is quite slow when the domain already contracts (the stdlib cannot assume and has to run a few ifs) that pointers are aligned an non overlapping.
https://godbolt.org/z/15oYqMKjn

4)
NNUE weights can be calculated faster by skipping some intrinsics for AVX2.
Making this much faster: https://github.com/official-stockfish/S ... 40-L211C40

This is a throwaway sentence above but its the most important part right here. If you read this, it has maybe 30% of the overall impact.
If you are a SF developer read this sentence and you will understand instantly. Applicable for all AVX, AVX512 except for VNNI (then its good).
The relu layer clips the inputs to 0..128, making the transformation from packed 16bits to 32bit accumulators not necessary every iteration
So you dont need _mm256_madd_epi16 every iteration. Only on every 32th iteration overflow is possible. Skipping all of these intrinsics leads to
this perfectly: acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));

For 8 accumulators. Using 8 accumulators instead of 4 has another advantage:

5)
Without register spilling its possible to increase internal accumulators to 8 making this function m256_haddx8 - that allows this function to never mix SSE and AVX which is a slowdown.
https://github.com/official-stockfish/S ... #LL196C37-

For this one I can share my code.
Of course my style is to overload this function so it does more than it says. (adding biases for example)
Code: Select all
		static inline __m256i accumulator_reduce(__m256i accs[8], __m256i bias) {
			const __m256i one = _mm256_set1_epi16(1);

			accs[0] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[0], one), _mm256_madd_epi16(accs[1], one));
			accs[1] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[2], one), _mm256_madd_epi16(accs[3], one));
			accs[2] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[4], one), _mm256_madd_epi16(accs[5], one));
			accs[3] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[6], one), _mm256_madd_epi16(accs[7], one));

			//a0 a1 a2 a3; b0 b1 b2 b3; c0 c1 c2 c3; d0 d1 d2 d3; a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7
			//e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3; e4 e5 e6 e7; f4 f5 f6 f7; g4 g5 g6 g7; h4 h5 h6 h7
			//a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7; e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3
			accs[0] = _mm256_hadd_epi32(accs[0], accs[1]);
			accs[1] = _mm256_hadd_epi32(accs[2], accs[3]);
			accs[2] = _mm256_permute2x128_si256(accs[0], accs[1], 0b100001);

			//Blend and add bias
			return _mm256_add_epi32(bias, _mm256_blend_epi32(
					_mm256_add_epi32(accs[0], accs[2]),
					_mm256_add_epi32(accs[1], accs[2]),
				0b11110000));
		}
All in all I want to share my performance log. This is inferences per second on random positions (so including full accumulator rebuild and no incrementality) on a single thread.
Disclaimer: some of this is not applicable when VNNI is available but most of it is, and I cant say what the improved memory layout does for NEON but going from many pointers into a 2 aligned SIMD pointers should help.
Code: Select all
//17.04.23 0.04 MNPS
//17.04.23 0.045 MNPS
//17.04.23 0.054 MNPS
//18.04.23 0.054 MNPS
//18.04.23 0.146 MNPS
//20.04.23 0.253 MNPS
//20.04.23 0.262 MNPS
//21.04.23 0.266 MNPS
//22.04.23 0.269 MNPS
//25.04.23 3.067 MNPS
//27.04.23 3.320 MNPS
//27.04.23 3.370 MNPS
//29.04.23 4.450 MNPS
//01.05.23 4.491 MNPS
//02.05.23 4.712 MNPS
Going forward I can say that the post format here is too limited and I should finally get around and publish all knowledge on a seperate blog-like website. All in all I described 7 ideas relevant to NNUE performance of around 35 that increased performance.

It boils down to:
1) having all readable inside a single function with const and non const pointers and no outside references etc. (helps the compiler A LOT)
2) improving memory layout and reshuffle some weights to get from incremental layer to output faster with better intrinsics
3) decreasing the overall cost of intrinsics by finding redundancies from domain knowledge. (For example knowing a value is strictly below 128 and suddenly you can remove some instructions because it cannot overflow an integer)

It will probably be easier to improve the speed of Stockfish by 500% to 1000% on Apple devices with M1, M2, M3, pro, max, ultra chip, than making NNUE 60% faster on other devices.

Ras · Post by **Ras** » Sun Jul 30, 2023 6:57 pm

Magnum wrote: ↑Sun Jul 30, 2023 5:18 pmIt will probably be easier to improve the speed of Stockfish by 500% to 1000% on Apple devices with M1, M2, M3, pro, max, ultra chip, than making NNUE 60% faster on other devices.

Wishful thinking. The AI stuff is badly documented and very limited in what network architectures it can deal with, and the point of NNUE is not needing support from a GPU, and even if, Nvidia would be the way.

Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster