Making NNUE 60% faster

syzygy · Post by **syzygy** » Mon Jun 12, 2023 11:20 pm

chrisw wrote: ↑Mon Jun 12, 2023 7:07 pmSF devs were historically only interested in risk analysis, if it worked (gave Elo) it stayed in - which seems reasonable. Overflow isn't going to break anything (assuming the NNUE out is clamped to something sensible).

I think the main problem is that it will give different node counts on different CPUs.

connor_mcmonigle · Post by **connor_mcmonigle** » Tue Jun 13, 2023 5:13 am

dangi12012 wrote: ↑Mon Jun 12, 2023 8:31 pm ...
Finally we arrive at a point where a specific question is asked and I have the feeling that this thread is productive and intellectually honest.
Let me put one different point aside first - above is the argument that consolidation of added_idx and removed_idx increases cache pressure - its exactly the opposite. Instead of having to load 2 independent cache lines in chunks of 1024 shorts = 2048byte we only need to do that once. reading half as much memory and at 1 location instead of 2. Together with faster idx calculation that is the most substantial improvement.
...

This is simply wrong. This isn't how cache pressure works. Sure, you're loading half as many cache lines per update. However, you're also accessing on the order of N^2 as many unique indices (where N is the cardinality of the base feature set). Consequently, the effective cache hit rate is significantly lower due to the increased pressure. You've even observed this phenomenon yourself as you mention that it only proved 1.7x faster in your toy benchmark.

AndrewGrant · Post by **AndrewGrant** » Tue Jun 13, 2023 6:20 pm

_mm256_add_epi16 is not a saturated addition lol. You are not "safe" because you only care about the high bits due to the later activations.

I will provide actual code, and not a zip file and a god damn mountain of useless self-jerking text.

Code: Select all

#include <iostream>
#include <immintrin.h>

int main() {

    __m256i A = _mm256_set1_epi16(32000);
    __m256i Out = _mm256_add_epi16(A, A);

    int16_t buffer[16];
    _mm256_storeu_si256((__m256i*) buffer, Out);

    std::cout << buffer[0];
}

Code: Select all

$ g++ temp.c -march=native && ./a.out
-1536

Anyway, we've all concluded what we've already known, which is that the maddubs comment is not true.

dangi12012 · Post by **dangi12012** » Tue Jun 13, 2023 6:54 pm

You are both missing the point about _mm256_maddubs_epi16 entirely.

Find a mistake in the following statements - if not you 2 are disqualified from writing any posts here under this thread again. Andrew for calling bunk on page1 while admitting to using improved weight layout himself on page 3 - and missing the point on _mm256_maddubs_epi16.
Sopel for not knowing the 0..126 limit of featureactivation and missing the point for 1+ week on _mm256_maddubs_epi16.

Lets do the math:
input_simd[m] is in range [0, 126] because it is after activation
`*w0` is in range [-128, 127] because it's the weight and could assume any int8 value.

_mm256_maddubs_epi16
Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.
mm256_maddubs_epi16 is therefore in range [-128, 127] * [0, 126] == [-16128, 16002] for all 16 slots. (w0 * input_simd)

_mm256_add_epi16
We can sum "once". We add two values together in here to achieve a final:
[-32256, 32004] of a max possible [-32768, 32767]

At which point we need to upcast to 32bit integers for accumulation into a bigger int32 accumulator.
_mm256_madd_epi16(x, ONE)
--------------------------------------------------------------------------------------------------------------------------------
Meaning - yes the code in February was bugged (I see it that way too). But the code in master is not optimal either.
https://github.com/official-stockfish/S ... 7ceb83dbf8
So this improvement is valid - and 5% imrovement of the 60% of this thread.

Old:

Code: Select all

__m256i product0 = _mm256_maddubs_epi16(a0, b0);
__m256i product1 = _mm256_maddubs_epi16(a1, b1);
product0 = _mm256_madd_epi16(product0, _mm256_set1_epi16(1));
product1 = _mm256_madd_epi16(product1, _mm256_set1_epi16(1));
acc = _mm256_add_epi32(acc, _mm256_add_epi32(product0, product1));

New and valid in the context of FeatureTransform * L0 Transformation because of [0, 126] limit:

Code: Select all

__m256i product0 = _mm256_maddubs_epi16(a0, b0);
__m256i product1 = _mm256_maddubs_epi16(a1, b1);
__m256i sum = _mm256_add_epi16(product0, product1);
acc = _mm256_add_epi32(acc, _mm256_madd_epi16(sum, _mm256_set1_epi16(1)));

What I did on top of that is that "once" can be replaced by "up to 32" depending on the actual weights too. And now you better find a flaw in my math or be silent forever.

AndrewGrant · Post by **AndrewGrant** » Tue Jun 13, 2023 7:13 pm

Okay so TLDR:

1. 60% number is wrong
2. maddubs can not be massively chained together as you stated
3. Stockfish team already knows about this, and they know they could do 4x maddubs if they wanted to, since I told them how.

I'm glad you _finally_ addressed the only question that Sopel or I cared about even remotely here.

Sopel · Post by **Sopel** » Tue Jun 13, 2023 8:33 pm

dangi12012 wrote: ↑Tue Jun 13, 2023 6:54 pm Sopel for not knowing the 0..126 limit of featureactivation and missing the point for 1+ week on _mm256_maddubs_epi16.

Or maybe you don't know that only after the first layer the activation results in values in range 0..126, but later we have normal crelu that results in values in range 0..127? And maybe I've known for a long time how exactly _mm256_maddubs_epi16 works and under which conditions you'd be right, though you failed to specify them and I deemed them unreasonable? You're, as always, very quick to assume yourself as correct.

Sopel · Post by **Sopel** » Tue Jun 13, 2023 8:45 pm

dangi12012 wrote: ↑Tue Jun 13, 2023 6:54 pm _mm256_maddubs_epi16
Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.
mm256_maddubs_epi16 is therefore in range [-128, 127] * [0, 126] == [-16128, 16002] for all 16 slots. (w0 * input_simd)

You can't be serious. This is plain wrong. I refer you to the documentation https://www.intel.com/content/www/us/en ... xpand=4270. A single _mm256_maddubs_epi16 call sums two multiplication results.

dangi12012 · Post by **dangi12012** » Tue Jun 13, 2023 8:53 pm

Sopel of course strategically ignored this sentence completely:
"Valid in the context of FeatureTransform * L0 Transformation because of [0, 126] limit."
Which is incidentally the biggest layer and the one we care about the most.

Andrew changed his mind from "all of what is posted here is bunk" to - everyone knew it all along. Which is textbook gaslighting.

I rest my case - this thread is finished now.

Sopel · Post by **Sopel** » Tue Jun 13, 2023 8:56 pm

I don't know if it's on purpose, or you're just completely fucked as a human, but you're just insufferable.

dangi12012 · Post by **dangi12012** » Tue Jun 13, 2023 9:25 pm

Sidenote:
Im actually also suprised that you allow merges that subtracts elo.
https://github.com/official-stockfish/S ... 7ceb83dbf8
A few years back the policy was exactly the opposite as a reader above pointed out - as we dont care about what happens maybe once in 10 Billion nodes if it increases strenght? There are for example many engines that dont care about some multithreading aspects as atomics may be more expensive than very rare wrong results in the overall strength.
Not judging - suprised that thats the policy now. Others would not care if we commit playdough into the sourcecode - when the outcome plays stronger chess. Personally agnostic on that issue.

So you thought you found a bug and fixed it - but actually you made stockfish weaker as an engine. Additionally you wrote a post where you claim featuretransform goes from "input_simd[m]` is in range <0, 127>, because it is after activation" which is false also. Thats enough embarassment. Together with the post above that is against talkchess and SF contributor guidelines...

Also dont get me started on atomics. Andrew uses threadID indexed arrays - SF uses thread_local buffers. One of which has to be better - its a binary choice. If it depends on the arch it should be a preprocessor macro in either case.

Interesting stuff everything.
I rest my case again.

I might come back when I can announce some working NNUE cuda with their excellent intrinsics or the single .h NNUE file is good enough.
Too often I see the same 5 file structure with lots and lots of macros. Not so good for beginners and with KISS principles everything fits in a single readable file similar to this: https://github.com/lithander/Leorik/pul ... 560cab91c8

Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster