I have recently done some reworking of Arasan's code to update the first layer of the network, which is the feature transformer that basically does a sparse vector multiplication: taking a set of indices, and applying the weights and biases of the network layer for those index values to compute or update a 1024-byte accumulator. That code is in simd.h (https://github.com/jdart1/nnue/blob/SFv4/simd.h), methods update and fullUpdate.
When I measure the performance of this code, I find:
NEON and AVX512 variants perform much better than the previous version without these specialized methods
When compiling for AVX2, GCC generates quite poor code. The whole idea is to stage data into registers, update, then write back to memory. GCC keeps the data on the stack, not in registers. It only uses a couple of the 16 AVX2 vector registers. MSVC is pretty bad, too.
Clang does much better, not just using the registers better, but doing some loop unrolling. However, neither GCC nor clang compiles work on Windows, due to a longstanding and apparently unfixable bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412).
I am bit stymied with this issue. I don't really want to drop down into assembly code, and anyway MSVC doesn't even support that.
feature transformer update with AVX2
Moderator: Ras
-
- Posts: 4391
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
-
- Posts: 5667
- Joined: Tue Feb 28, 2012 11:56 pm
Re: feature transformer update with AVX2
I remember some versions of gcc had an alignment problem on Windows, but I thought that gcc 9.x had solved that:jdart wrote: ↑Wed Jan 24, 2024 8:12 pm However, neither GCC nor clang compiles work on Windows, due to a longstanding and apparently unfixable bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412).
Code: Select all
// Old gcc on Windows is unable to provide a 32-byte aligned stack.
// We need to hack around this when using AVX2 and AVX512.
#if defined(__GNUC__ ) && (__GNUC__ < 9) && defined(_WIN32) \
&& !defined(__clang__) && !defined(__INTEL_COMPILER) \
&& defined(USE_AVX2)
#define ALIGNMENT_HACK
#endif
Code: Select all
Value nnue_evaluate(const Position *pos)
{
int32_t out_value;
#ifdef ALIGNMENT_HACK // work around a bug in old gcc on Windows
uint8_t buf[sizeof(struct NetData) + 63];
struct NetData *b = (struct NetData *)(buf + ((((uintptr_t)buf-1) ^ 0x3f) & 0x3f));
#define B(x) (b->x)
#else
struct NetData buf;
#define B(x) (buf.x)
#endif
-
- Posts: 4391
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: feature transformer update with AVX2
I believe the problem is that the compiler can spill data to temp variables on the stack and those are not aligned. Those memory accesses are invisible to the C++ coder and not controlled by them.
It is still an issue on newer compilers. My clang version on Windows is version 16.0.5.
It is still an issue on newer compilers. My clang version on Windows is version 16.0.5.
-
- Posts: 5667
- Joined: Tue Feb 28, 2012 11:56 pm
Re: feature transformer update with AVX2
In Stockfish the test is for gcc 9.2 and older in combination with Windows (and !clang):jdart wrote: ↑Thu Jan 25, 2024 4:43 am I believe the problem is that the compiler can spill data to temp variables on the stack and those are not aligned. Those memory accesses are invisible to the C++ coder and not controlled by them.
It is still an issue on newer compilers. My clang version on Windows is version 16.0.5.
Code: Select all
#if defined(__GNUC__) && (__GNUC__ < 9 || (__GNUC__ == 9 && __GNUC_MINOR__ <= 2)) \
&& defined(_WIN32) && !defined(__clang__)
#define ALIGNAS_ON_STACK_VARIABLES_BROKEN
#endif
Older gcc supported stack alignment only up to MAX_STACK_ALIGNMENT/MAX_SUPPORTED_STACK_ALIGNMENT, which was 16 on Windows and 32 on Linux. This indeed seems to have been fixed in 9.3:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357
-
- Posts: 4391
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: feature transformer update with AVX2
After some futzing with the Makefile, I have gotten it to build with clang-cl (the version of clang that integrates with MSVC tools). That seems to run ok with AVX2 on. Will look again also at regular clang/cygwin and see if that can be made to work.
"bench" command results:
MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps
"bench" command results:
MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps
-
- Posts: 5667
- Joined: Tue Feb 28, 2012 11:56 pm
Re: feature transformer update with AVX2
Nice improvement.jdart wrote: ↑Wed Jan 31, 2024 2:04 am After some futzing with the Makefile, I have gotten it to build with clang-cl (the version of clang that integrates with MSVC tools). That seems to run ok with AVX2 on. Will look again also at regular clang/cygwin and see if that can be made to work.
"bench" command results:
MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps
Can you get it to work with gcc-9.3 or higher? E.g. on Fedora I can easily install a mingw g++-13.2.1 cross-compiler (and it looks like mingw clang should be available as well, but I did not try).
-
- Posts: 5667
- Joined: Tue Feb 28, 2012 11:56 pm
Re: feature transformer update with AVX2
Replacing "-lc -lm" with "-static" in NN_LIBS, I got nnue_test.exe to compile with mingw g++-13.2.1 and to run on top of wine with 0 errors.syzygy wrote: ↑Thu Feb 01, 2024 7:06 pmNice improvement.jdart wrote: ↑Wed Jan 31, 2024 2:04 am After some futzing with the Makefile, I have gotten it to build with clang-cl (the version of clang that integrates with MSVC tools). That seems to run ok with AVX2 on. Will look again also at regular clang/cygwin and see if that can be made to work.
"bench" command results:
MSVC compile with PGO: 681k nps (1 core)
clang-cl compile with PGO: 857k nps
Can you get it to work with gcc-9.3 or higher? E.g. on Fedora I can easily install a mingw g++-13.2.1 cross-compiler (and it looks like mingw clang should be available as well, but I did not try).
-
- Posts: 1945
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: feature transformer update with AVX2
Without looking at any code, nor asm output, it has been my experience with NNUE both in Ethereal and in Torch, that clang always outperforms gcc in those sections of code, even if in aggregate clang does not appear much faster than gcc.
I've arrived at countless cases where I observe gcc introducing an unnecessary data dependence causing the pipeline to stall, where as clang intelligently reorders the operations to avoid the issue. Although I can't say for sure whether these visible "flaws" in gcc's output actually account for the difference in execution speed.
I've arrived at countless cases where I observe gcc introducing an unnecessary data dependence causing the pipeline to stall, where as clang intelligently reorders the operations to avoid the issue. Although I can't say for sure whether these visible "flaws" in gcc's output actually account for the difference in execution speed.