Zach Wegner wrote:
I thought the popcnt instruction was only available on Core i7s?
SSE 4a and SSE 4.2 so i7 and Phenom
If someone says their software is too slow I assume they are not using a 386.
But really, it makes sense to optimize for current technology.
Ryan Benitez wrote:Why not just use a faster popcnt?
in msvc its just:
#include <intrin.h>
#define count_1s(b) (__popcnt64(b))
The asm instruction is popcnt so that is simple too. Msvc x64 does not take inline asm so that is the only frustrating part.
I thought the popcnt instruction was only available on Core i7s?
Anyways, the idea is workable IMO. The magic would be something like shifting the attack bit for every attackable square up to bit 60, and then the sum would be in the top 4 bits. Overflows make it considerably more complicated, but adapting the existing magic finders to do this shouldn't be too hard.
Actually I believe it is supported on any processor that has the sse 4.2 flag set when you use the CPUID instruction.
At least the intel C++ compiler has a built-in intrinsic to access the popcnt instruction. I don't use it as I have the inline assembly code for MSB(), LSB() and PopCnt() already done, and just created a version of PopCnt() for SSE4.2 processors that uses the popcnt asm instruction.
BTW using popcnt was almost an immeasurable change when I was testing on the Nehalem box we had here a month ago. If you play your cards right, you can keep PopCnt() from being an important factor at all.
Ryan Benitez wrote:I use PopCnt() a lot. It is the sacrifice I made for making the board as small as I can.
Makes perfect sense. No need to store and maintain redundant piece-counters anymore. On AMD K10 popcnt has latency of 2 cycles restricted to scheduling in pipe 2. No exact intel data handy on popcnt, but I guess it is in a similar range. To use popcnt as often as desired without tears anymore, even sparse populated sets for center- or king-distance related mobility, where without popcnt instruction one would prefer Brian Kernighan's loop approach or probably a SSE2 dot-product.
I have one question regarding runtime check of CPU capabilities.
How this can be used for a chess engine that is supposed to be released as a binary and downloaded and used by anyone on any computer ?
I mean, I cannot use a function pointer to redirect at runtime on a fallback standard C code implementation if host CPU does not support POPCNT. That would be far too slow. This kind of functions must be inlined. So, when I check that host CPU has or not has the popcnt capability what can I do ?
Chess engine is not supposed to be compiled on _any_ pc that will run it, but is compiled once with the best optimization and distributed.
The only possibility I foreseen is two create two compiles, one for CPU with POPCNT and another for CPU without POPCNT, but considering that we need also another two versions for 32 and 64 bits we are ramping up fast on this combinatorial escalation.
mcostalba wrote:I have one question regarding runtime check of CPU capabilities.
Well, I don't do it at runtime. I have a Makefile that generates an executable that creates a Makefile with CPU info in it and subsequently I include the generated Makefile into the original Makefile so it will optimize for the platform it is running on. Subsequently, for cross-platform builds, I can explictly enable, disable features by invoking make differently:
Or have this done in a Makefile for loop if I want to build the same source code but with different flags.
I guess if you are releasing it on the internet you could make multiple executables. This may not be the best way of doing it but this is what I do.
The only possibility I foreseen is two create two compiles, one for CPU with POPCNT and another for CPU without POPCNT, but considering that we need also another two versions for 32 and 64 bits we are ramping up fast on this combinatorial escalation.
With a Makefile based build-system, this really isn't a problem (apart from the build-time).
Last edited by Pradu on Sun May 10, 2009 10:32 pm, edited 1 time in total.
we are ramping up fast on this combinatorial escalation.
I'm guessing this is the reason why many programs are distributed without binaries. But for a closed-source program this could indeed be a problem but atleast for computer-chess it isn't too bad. For example if you have a chess engine that targets x86/x64 platforms with or without popcnt, for AMD processors and for Intel processors, for Windows, Linux and Mac and that's only about 24 executables. I guess you could take a few more options (perhaps around a hundred executables) and after that point it would indeed be a serious problem but I don't know any workaround other than distributing the source code which may or may not be possible.