feature transformer update with AVX2

jdart · Post by **jdart** » Wed Jan 24, 2024 8:12 pm

I have recently done some reworking of Arasan's code to update the first layer of the network, which is the feature transformer that basically does a sparse vector multiplication: taking a set of indices, and applying the weights and biases of the network layer for those index values to compute or update a 1024-byte accumulator. That code is in simd.h (https://github.com/jdart1/nnue/blob/SFv4/simd.h), methods update and fullUpdate.

When I measure the performance of this code, I find:

NEON and AVX512 variants perform much better than the previous version without these specialized methods

When compiling for AVX2, GCC generates quite poor code. The whole idea is to stage data into registers, update, then write back to memory. GCC keeps the data on the stack, not in registers. It only uses a couple of the 16 AVX2 vector registers. MSVC is pretty bad, too.

Clang does much better, not just using the registers better, but doing some loop unrolling. However, neither GCC nor clang compiles work on Windows, due to a longstanding and apparently unfixable bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412).

I am bit stymied with this issue. I don't really want to drop down into assembly code, and anyway MSVC doesn't even support that.

syzygy · Post by **syzygy** » Thu Jan 25, 2024 1:50 am

jdart wrote: ↑Wed Jan 24, 2024 8:12 pm However, neither GCC nor clang compiles work on Windows, due to a longstanding and apparently unfixable bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412).

I remember some versions of gcc had an alignment problem on Windows, but I thought that gcc 9.x had solved that:

Code: Select all

// Old gcc on Windows is unable to provide a 32-byte aligned stack.
// We need to hack around this when using AVX2 and AVX512.
#if     defined(__GNUC__ ) && (__GNUC__ < 9) && defined(_WIN32) \
    && !defined(__clang__) && !defined(__INTEL_COMPILER) \
    &&  defined(USE_AVX2)
#define ALIGNMENT_HACK
#endif

The hack isn´t very nice but I guess it worked:

Code: Select all

Value nnue_evaluate(const Position *pos)
{
  int32_t out_value;
#ifdef ALIGNMENT_HACK // work around a bug in old gcc on Windows
  uint8_t buf[sizeof(struct NetData) + 63];
  struct NetData *b = (struct NetData *)(buf + ((((uintptr_t)buf-1) ^ 0x3f) & 0x3f));
#define B(x) (b->x)
#else
  struct NetData buf;
#define B(x) (buf.x)
#endif

jdart · Post by **jdart** » Thu Jan 25, 2024 4:43 am

I believe the problem is that the compiler can spill data to temp variables on the stack and those are not aligned. Those memory accesses are invisible to the C++ coder and not controlled by them.

It is still an issue on newer compilers. My clang version on Windows is version 16.0.5.

syzygy · Post by **syzygy** » Sun Jan 28, 2024 4:35 pm

jdart wrote: ↑Thu Jan 25, 2024 4:43 am I believe the problem is that the compiler can spill data to temp variables on the stack and those are not aligned. Those memory accesses are invisible to the C++ coder and not controlled by them.

It is still an issue on newer compilers. My clang version on Windows is version 16.0.5.

In Stockfish the test is for gcc 9.2 and older in combination with Windows (and !clang):

Code: Select all

    #if defined(__GNUC__) && (__GNUC__ < 9 || (__GNUC__ == 9 && __GNUC_MINOR__ <= 2)) \
      && defined(_WIN32) && !defined(__clang__)
        #define ALIGNAS_ON_STACK_VARIABLES_BROKEN
    #endif

I assume clang on Windows is still supported, so I don't quite understand why it works for Stockfish (without hack) but not for you. Does clang issue a warning?

Older gcc supported stack alignment only up to MAX_STACK_ALIGNMENT/MAX_SUPPORTED_STACK_ALIGNMENT, which was 16 on Windows and 32 on Linux. This indeed seems to have been fixed in 9.3:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357

jdart · Post by **jdart** » Wed Jan 31, 2024 2:04 am

After some futzing with the Makefile, I have gotten it to build with clang-cl (the version of clang that integrates with MSVC tools). That seems to run ok with AVX2 on. Will look again also at regular clang/cygwin and see if that can be made to work.

"bench" command results:

MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps

syzygy · Post by **syzygy** » Thu Feb 01, 2024 7:06 pm

jdart wrote: ↑Wed Jan 31, 2024 2:04 am After some futzing with the Makefile, I have gotten it to build with clang-cl (the version of clang that integrates with MSVC tools). That seems to run ok with AVX2 on. Will look again also at regular clang/cygwin and see if that can be made to work.

"bench" command results:

MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps

Nice improvement.
Can you get it to work with gcc-9.3 or higher? E.g. on Fedora I can easily install a mingw g++-13.2.1 cross-compiler (and it looks like mingw clang should be available as well, but I did not try).

syzygy · Post by **syzygy** » Thu Feb 01, 2024 7:20 pm

syzygy wrote: ↑Thu Feb 01, 2024 7:06 pm
jdart wrote: ↑Wed Jan 31, 2024 2:04 am After some futzing with the Makefile, I have gotten it to build with clang-cl (the version of clang that integrates with MSVC tools). That seems to run ok with AVX2 on. Will look again also at regular clang/cygwin and see if that can be made to work.

"bench" command results:

MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps
Nice improvement.
Can you get it to work with gcc-9.3 or higher? E.g. on Fedora I can easily install a mingw g++-13.2.1 cross-compiler (and it looks like mingw clang should be available as well, but I did not try).

Replacing "-lc -lm" with "-static" in NN_LIBS, I got nnue_test.exe to compile with mingw g++-13.2.1 and to run on top of wine with 0 errors.

AndrewGrant · Post by **AndrewGrant** » Fri Feb 02, 2024 1:17 pm

Without looking at any code, nor asm output, it has been my experience with NNUE both in Ethereal and in Torch, that clang always outperforms gcc in those sections of code, even if in aggregate clang does not appear much faster than gcc.

I've arrived at countless cases where I observe gcc introducing an unnecessary data dependence causing the pipeline to stall, where as clang intelligently reorders the operations to avoid the issue. Although I can't say for sure whether these visible "flaws" in gcc's output actually account for the difference in execution speed.

feature transformer update with AVX2

feature transformer update with AVX2

Re: feature transformer update with AVX2

Re: feature transformer update with AVX2

Re: feature transformer update with AVX2

Re: feature transformer update with AVX2

Re: feature transformer update with AVX2

Re: feature transformer update with AVX2

Re: feature transformer update with AVX2