VS compile AVX512

chrisw · Post by **chrisw** » Mon Apr 10, 2023 6:46 pm

Several weirdnesses.
First off, it’s slower (AMD 5790X, iirc) than just leaving the code to compile AVX2 (I’m working on accumulator updates only right now, btw).
Second the compiler isn’t fully optimising what it should be.

Briefly:
Mm512 registers[16];

When I work on the registers in a loop, it unrolls and has each register loaded okay, but insists on also saving each operation into ram as well, which it knows perfectly well it doesn’t need to do.

Mm256 registers[16];
It does fine.

And if I declare, unrolled
Mm512 r0, r1, r2 etc
And manually load and work on them, it also works fine.

Can’t find a way to get the avx512 optimiser to do it right.

And even when the code in manually done, it’s still way slower than the avx2 equivalent.

Anyone?

The bright side is the new AMD is blisteringly fast and makes up for the disappointing(?) AVX512 performance.

Joost Buijs · Post by **Joost Buijs** » Tue Apr 11, 2023 8:31 am

The Visual Studio compiler is not very good at auto vectorization, the LLVM compiler (CLang) is a lot better in this respect.
Of course, when you use handwritten SIMD code (like most programmers seem to do) the choice of compiler won't matter much.
You can integrate the CLang compiler into VS by selecting it in the VS installer program, the latest version of VS will install CLang 15.0.1 by default.

It is also possible to integrate an external version of LLVM into VS by placing a manifest file called 'Directory.build.props' in your project folder.
E.g. for the latest LLVM 16.0.1 it should contain:

Code: Select all

<Project>
  <PropertyGroup>
    <LLVMInstallDir>C:\Program Files\LLVM</LLVMInstallDir>
    <LLVMToolsVersion>16.0.1</LLVMToolsVersion>
  </PropertyGroup>
</Project>

Don't expect too much from AVX-512, I usually see a N/S gain of 20 to 25% compared to the AVX2 version of my engine, basically because the NN inference code runs faster. This is on my Intel Core i9-10980XE which has AVX-512, I've not tried yet on one of the new Zen4 AMD machines (I'm still waiting for the new Zen4 Threadrippers to arrive). Especially copying the accumulator (which actually has to be done twice for each ply because you have to do RELU) is a real performance hog.

I also wonder what the effect of the AMD AVX-512 implementation is, AMD seem to use two 256 bit operations while Intel uses single 512 bit operations.

dangi12012 · Post by **dangi12012** » Tue Apr 11, 2023 10:10 am

Clang is much more effective because when you write intrinsics for MSVC they are mostly untouched.

Clang does reorder and change intrinsics. For example when you write ineffective intrsinsic functions it will change them to more appropriate ones.
For example and "and" "or" will certainly get fused into _mm512_ternarylogic by clang, MSVC started to do this but they are far behind - staying at around 80% of performance in most code I have - (sometimes 50% sometimes 110% of clang but on average 80% when it comes to AVX)

Auto vectorisation is not as fast as changing your data structures and using alignas(64) manually to be vectorisation friendly.
That being said MSVC and Clang are the top compilers for C++ atm.

dangi12012 · Post by **dangi12012** » Tue Apr 11, 2023 10:19 am

Also clang has many more subflags to enable or disable features as well as march=native.
MSVC has the generic AVX512 flag which for instance does not have granular control over GFNI (_mm_gf2p8affine_epi64) or other extensions.

Fun fact on that front - a viable zobrist hash replacement is to churn your Board through this:
_mm512_movepi8_mask(_mm512_aesenc_epi128(BB, Random_KEY))
these are quite handy: https://www.intel.com/content/www/us/en ... S,SHA,VAES

Which is very fast since there are not 2-3 lookups per turn into the zobrist table - 1.2Billions hash/s per thread in my tests.
Fun times ahead.

dangi12012 · Post by **dangi12012** » Tue Apr 11, 2023 10:26 am

And to come back to your question like Joost Buijs answer above its simplest to make your projekt a cmake project (which in turn enables you to develop on any platform) and for windows define this target:

Code: Select all

cmake -G"Visual Studio 2022" -T ClangCL

To use clang on all platforms as its the fastest for intrinsics currently.

Joost Buijs · Post by **Joost Buijs** » Tue Apr 11, 2023 12:18 pm

Indeed, a cmake project is a good option if you want to be portable between different OS, unfortunately there are still things missing from the C++ library like sockets or memory mapped-IO to be 100% portable. I don't know if a library like 'boost' can solve this because I simply never used it.

chrisw · Post by **chrisw** » Tue Apr 11, 2023 5:20 pm

Joost Buijs wrote: ↑Tue Apr 11, 2023 8:31 am The Visual Studio compiler is not very good at auto vectorization, the LLVM compiler (CLang) is a lot better in this respect.
Of course, when you use handwritten SIMD code (like most programmers seem to do) the choice of compiler won't matter much.
You can integrate the CLang compiler into VS by selecting it in the VS installer program, the latest version of VS will install CLang 15.0.1 by default.

It is also possible to integrate an external version of LLVM into VS by placing a manifest file called 'Directory.build.props' in your project folder.
E.g. for the latest LLVM 16.0.1 it should contain:
Code: Select all
<Project>
  <PropertyGroup>
    <LLVMInstallDir>C:\Program Files\LLVM</LLVMInstallDir>
    <LLVMToolsVersion>16.0.1</LLVMToolsVersion>
  </PropertyGroup>
</Project>
Don't expect too much from AVX-512, I usually see a N/S gain of 20 to 25% compared to the AVX2 version of my engine, basically because the NN inference code runs faster. This is on my Intel Core i9-10980XE which has AVX-512, I've not tried yet on one of the new Zen4 AMD machines (I'm still waiting for the new Zen4 Threadrippers to arrive). Especially copying the accumulator (which actually has to be done twice for each ply because you have to do RELU) is a real performance hog.

I also wonder what the effect of the AMD AVX-512 implementation is, AMD seem to use two 256 bit operations while Intel uses single 512 bit operations.

I think I dl-ed clang into VS but can’t find anything on any menu that says so

Bo Persson · Post by **Bo Persson** » Tue Apr 11, 2023 6:56 pm

chrisw wrote: ↑Tue Apr 11, 2023 5:20 pm I think I dl-ed clang into VS but can’t find anything on any menu that says so

Check your Project Properties. In the Platform Toolset drop down box you should have LLVM as an alternative.

chrisw · Post by **chrisw** » Tue Apr 11, 2023 7:34 pm

Bo Persson wrote: ↑Tue Apr 11, 2023 6:56 pm
chrisw wrote: ↑Tue Apr 11, 2023 5:20 pm I think I dl-ed clang into VS but can’t find anything on any menu that says so
Check your Project Properties. In the Platform Toolset drop down box you should have LLVM as an alternative.

Ahh! That's where it was hiding. 100 new warnings and one successful compile later, and crash. Well, it's a start, probably too much to hope for instant success ...

chrisw · Post by **chrisw** » Tue Apr 11, 2023 8:07 pm

chrisw wrote: ↑Tue Apr 11, 2023 7:34 pm
Bo Persson wrote: ↑Tue Apr 11, 2023 6:56 pm
chrisw wrote: ↑Tue Apr 11, 2023 5:20 pm I think I dl-ed clang into VS but can’t find anything on any menu that says so
Check your Project Properties. In the Platform Toolset drop down box you should have LLVM as an alternative.
Ahh! That's where it was hiding. 100 new warnings and one successful compile later, and crash. Well, it's a start, probably too much to hope for instant success ...

immintrin.h declared more often helped.

some intrinsics it seems unable to find, eg:

_mm_srli_si128

VS compile AVX512

VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512

Re: VS compile AVX512