but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.
The POPCNT instruction is (somehow) part of the SSE4 (Streaming SIMD Extensions 4) instruction set, so if you don't enable its use explicitly (by specifying -mpopcnt to the compiler (or -march=nehalem for Intel, -march=barcelona for AMD)), the compiler with replace the builtin popcount by its own implementation (which could actually be your implementation B).
Even doing so, do not expect an incredible speedup: for my engine the NPS difference between the non-POPCNT and the POPCNT build is around 2%.
For AMD new architectures, AVX2 helps a whole lot.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp
The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2
Try it, you'll like it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.
Who can help with answers?
or
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Bo Persson wrote: ↑Fri Aug 28, 2020 11:05 pm
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Bo Persson wrote: ↑Fri Aug 28, 2020 11:05 pm
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
popCount(unsigned long long):
xor eax, eax
popcnt rax, rdi
ret
I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.