Several weirdnesses.
First off, it’s slower (AMD 5790X, iirc) than just leaving the code to compile AVX2 (I’m working on accumulator updates only right now, btw).
Second the compiler isn’t fully optimising what it should be.
Briefly:
Mm512 registers[16];
When I work on the registers in a loop, it unrolls and has each register loaded okay, but insists on also saving each operation into ram as well, which it knows perfectly well it doesn’t need to do.
Mm256 registers[16];
It does fine.
And if I declare, unrolled
Mm512 r0, r1, r2 etc
And manually load and work on them, it also works fine.
Can’t find a way to get the avx512 optimiser to do it right.
And even when the code in manually done, it’s still way slower than the avx2 equivalent.
Anyone?
The bright side is the new AMD is blisteringly fast and makes up for the disappointing(?) AVX512 performance.
VS compile AVX512
Moderators: hgm, Rebel, chrisw
-
- Posts: 1573
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: VS compile AVX512
The Visual Studio compiler is not very good at auto vectorization, the LLVM compiler (CLang) is a lot better in this respect.
Of course, when you use handwritten SIMD code (like most programmers seem to do) the choice of compiler won't matter much.
You can integrate the CLang compiler into VS by selecting it in the VS installer program, the latest version of VS will install CLang 15.0.1 by default.
It is also possible to integrate an external version of LLVM into VS by placing a manifest file called 'Directory.build.props' in your project folder.
E.g. for the latest LLVM 16.0.1 it should contain:
Don't expect too much from AVX-512, I usually see a N/S gain of 20 to 25% compared to the AVX2 version of my engine, basically because the NN inference code runs faster. This is on my Intel Core i9-10980XE which has AVX-512, I've not tried yet on one of the new Zen4 AMD machines (I'm still waiting for the new Zen4 Threadrippers to arrive). Especially copying the accumulator (which actually has to be done twice for each ply because you have to do RELU) is a real performance hog.
I also wonder what the effect of the AMD AVX-512 implementation is, AMD seem to use two 256 bit operations while Intel uses single 512 bit operations.
Of course, when you use handwritten SIMD code (like most programmers seem to do) the choice of compiler won't matter much.
You can integrate the CLang compiler into VS by selecting it in the VS installer program, the latest version of VS will install CLang 15.0.1 by default.
It is also possible to integrate an external version of LLVM into VS by placing a manifest file called 'Directory.build.props' in your project folder.
E.g. for the latest LLVM 16.0.1 it should contain:
Code: Select all
<Project>
<PropertyGroup>
<LLVMInstallDir>C:\Program Files\LLVM</LLVMInstallDir>
<LLVMToolsVersion>16.0.1</LLVMToolsVersion>
</PropertyGroup>
</Project>
I also wonder what the effect of the AMD AVX-512 implementation is, AMD seem to use two 256 bit operations while Intel uses single 512 bit operations.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: VS compile AVX512
Clang is much more effective because when you write intrinsics for MSVC they are mostly untouched.
Clang does reorder and change intrinsics. For example when you write ineffective intrsinsic functions it will change them to more appropriate ones.
For example and "and" "or" will certainly get fused into _mm512_ternarylogic by clang, MSVC started to do this but they are far behind - staying at around 80% of performance in most code I have - (sometimes 50% sometimes 110% of clang but on average 80% when it comes to AVX)
Auto vectorisation is not as fast as changing your data structures and using alignas(64) manually to be vectorisation friendly.
That being said MSVC and Clang are the top compilers for C++ atm.
Clang does reorder and change intrinsics. For example when you write ineffective intrsinsic functions it will change them to more appropriate ones.
For example and "and" "or" will certainly get fused into _mm512_ternarylogic by clang, MSVC started to do this but they are far behind - staying at around 80% of performance in most code I have - (sometimes 50% sometimes 110% of clang but on average 80% when it comes to AVX)
Auto vectorisation is not as fast as changing your data structures and using alignas(64) manually to be vectorisation friendly.
That being said MSVC and Clang are the top compilers for C++ atm.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: VS compile AVX512
Also clang has many more subflags to enable or disable features as well as march=native.
MSVC has the generic AVX512 flag which for instance does not have granular control over GFNI (_mm_gf2p8affine_epi64) or other extensions.
Fun fact on that front - a viable zobrist hash replacement is to churn your Board through this:
_mm512_movepi8_mask(_mm512_aesenc_epi128(BB, Random_KEY))
these are quite handy: https://www.intel.com/content/www/us/en ... S,SHA,VAES
Which is very fast since there are not 2-3 lookups per turn into the zobrist table - 1.2Billions hash/s per thread in my tests.
Fun times ahead.
MSVC has the generic AVX512 flag which for instance does not have granular control over GFNI (_mm_gf2p8affine_epi64) or other extensions.
Fun fact on that front - a viable zobrist hash replacement is to churn your Board through this:
_mm512_movepi8_mask(_mm512_aesenc_epi128(BB, Random_KEY))
these are quite handy: https://www.intel.com/content/www/us/en ... S,SHA,VAES
Which is very fast since there are not 2-3 lookups per turn into the zobrist table - 1.2Billions hash/s per thread in my tests.
Fun times ahead.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: VS compile AVX512
And to come back to your question like Joost Buijs answer above its simplest to make your projekt a cmake project (which in turn enables you to develop on any platform) and for windows define this target:
To use clang on all platforms as its the fastest for intrinsics currently.
Code: Select all
cmake -G"Visual Studio 2022" -T ClangCL
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1573
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: VS compile AVX512
Indeed, a cmake project is a good option if you want to be portable between different OS, unfortunately there are still things missing from the C++ library like sockets or memory mapped-IO to be 100% portable. I don't know if a library like 'boost' can solve this because I simply never used it.
-
- Posts: 4346
- Joined: Tue Apr 03, 2012 4:28 pm
Re: VS compile AVX512
I think I dl-ed clang into VS but can’t find anything on any menu that says soJoost Buijs wrote: ↑Tue Apr 11, 2023 8:31 am The Visual Studio compiler is not very good at auto vectorization, the LLVM compiler (CLang) is a lot better in this respect.
Of course, when you use handwritten SIMD code (like most programmers seem to do) the choice of compiler won't matter much.
You can integrate the CLang compiler into VS by selecting it in the VS installer program, the latest version of VS will install CLang 15.0.1 by default.
It is also possible to integrate an external version of LLVM into VS by placing a manifest file called 'Directory.build.props' in your project folder.
E.g. for the latest LLVM 16.0.1 it should contain:
Don't expect too much from AVX-512, I usually see a N/S gain of 20 to 25% compared to the AVX2 version of my engine, basically because the NN inference code runs faster. This is on my Intel Core i9-10980XE which has AVX-512, I've not tried yet on one of the new Zen4 AMD machines (I'm still waiting for the new Zen4 Threadrippers to arrive). Especially copying the accumulator (which actually has to be done twice for each ply because you have to do RELU) is a real performance hog.Code: Select all
<Project> <PropertyGroup> <LLVMInstallDir>C:\Program Files\LLVM</LLVMInstallDir> <LLVMToolsVersion>16.0.1</LLVMToolsVersion> </PropertyGroup> </Project>
I also wonder what the effect of the AMD AVX-512 implementation is, AMD seem to use two 256 bit operations while Intel uses single 512 bit operations.
-
- Posts: 245
- Joined: Sat Mar 11, 2006 8:31 am
- Location: Malmö, Sweden
- Full name: Bo Persson
-
- Posts: 4346
- Joined: Tue Apr 03, 2012 4:28 pm
Re: VS compile AVX512
Ahh! That's where it was hiding. 100 new warnings and one successful compile later, and crash. Well, it's a start, probably too much to hope for instant success ...Bo Persson wrote: ↑Tue Apr 11, 2023 6:56 pmCheck your Project Properties. In the Platform Toolset drop down box you should have LLVM as an alternative.
-
- Posts: 4346
- Joined: Tue Apr 03, 2012 4:28 pm
Re: VS compile AVX512
immintrin.h declared more often helped.chrisw wrote: ↑Tue Apr 11, 2023 7:34 pmAhh! That's where it was hiding. 100 new warnings and one successful compile later, and crash. Well, it's a start, probably too much to hope for instant success ...Bo Persson wrote: ↑Tue Apr 11, 2023 6:56 pmCheck your Project Properties. In the Platform Toolset drop down box you should have LLVM as an alternative.
some intrinsics it seems unable to find, eg:
_mm_srli_si128