Engin wrote:hi Richard,
i dont use hw popcount for 32 bit, because my own pre-calculated 16bit lookup tables for popcount ist faster then using two 32bit hw popcount, only in 64 bit its faster and make sense to me.
here are my code look up tables
MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;
return (0);
}
the count_word is a 16 bit (0...65535) pre-calculated array, and i do only count the 4x16 bits
Wow you really progressed in programming skills, congratulations Ustun!
From not knowing how to program at all a year or 2 ago when we spoke for hours online, doing frantic efforts to get on my skype+facebook, to being busy with pretty low level subjects now. Well done!
Oh on what you posted. Just get rid of all that nonsense. Those lookup tables of 2^16 in case they are byte representation it's 65536 bytes.
That's exactly the size of the AMD level1 cache.
So probably at intel you get screwed already as the L1 datacache is 32KB.
In case you are just running a small testprogram, with as i assume RANDOM values each time from where you do the popcount, otherwise your test would be incorrect.
If you don't test random values, then the processor can just lookup from its L1d easily.
When running parallel with some sort of serious software that you didn't cut'n paste, then it's total nonsense to waste such a big lookup table as the program can use that very well.
In fact i can imagine it would really cripple Hyperthreading capabilities of software if you have such massive lookuptables screwing the L1d.
So whatever test you do there, i wouldn't use such big lookuptables to get values out of a bitboard and replace it by alternatives, maybe SSE4a, i assume that you can't afford a sixcore i7 is it? So you probably have an AMD sixcore which is just 1/3 of the price of that i7.
Correct assumption?
Over here i have Xeon L5420s and a core or 64.
They support SSSE, however it's pretty much idiocy to write things in SIMD instructions for a program like Diep which is a 32 bits program that doesn't have 64 bits bitboards.
That said i still need a specialized type of LSB/FSB in Diep as my attacktables store some bits, yet this happens very seldom in evaluation,
as i just use it to figure out in some seldom code whether there is for example 2 pawns attacking the same square. My attacktables give that info obviously as the bits 15..30 have a bit set from each of the opponents pieces. So that's where the LSB/FSB hops in handily.
No popcount need in Diep though. Yet i need that in some other type of software i build in my sparetime.
In Diep the LSB/FSB is 32 bits though. In general less instructions always is the superior speed then. I'm not a fan of magic type calculations. Those can only get handled in the multiplication unit and need a small array which for sure won't be in L1d by then. Very bad for HyperThreading not to mention the newest AMD chip. Multiplications really are a problem at modern CPU's, so if you can avoid them, get rid of them. Lucky there is in 32 bits instructions to convert a bit to an integer and popcount there is this simple while loop that gets you the number of bits quickly, used though not in Diep.
Looping goes really fast at modern cpu's. If you can expect just 1 bit at most to be counted, why not use it?
For some prime numbers i run here, to be precise a sieve, i'm using a 64 bits popcount. Now that runs mostly at AMD barcelona cores here and feeds Tesla GPU's. Fastest way to do 64 bits popcount there also of course isn't with arrays, that's always the inferior solution that is only fast in some immature tests, yet never in reality. I do that by simply some shifting. Benchmarking speed with magic number is pretty fast there in tests, yet i don't trust it for production code, again as doing tests is so limited always; so i'm just shifting there a tad to get the popcount in 64 bits
Nothing beats intuition there. Avoids me weeks of benchmarking with in the end the thing i cut'n pasted from the net there being the fastest way to do a Popcount anyway. I'm sure though that if you'd have SSE4 that might be faster, yet that means entire code needs to run in SSE4 so you can shake that as well.
Of course in AVX it's even faster than SSE4, yet are we sure we are on the same planet then? It means entire code must be in AVX.
What the gimps guys are doing there is total different btw, and a lot slower for me, yet again the important thing there is that i'm running this on AMD and not on intel - they use a small primebase that just fits in side the L1/L2 including the sieve. They don't go outside a 256KB working setsize,
whereas i'm filling it up until 1MB with a bigger primebase.
They're entirely 32 bits code, where i'm entirely 64 bits code. 64 bits is *considerable* faster there than 32 bits, except for storing the primes (can store a difference that just fits in 1 byte).
A bigger primebase has factor 2 less composites for sieving, yet this is all irrelevant here for the popcount, yet it shows you that we all fill up the caches to the limit always and then there is no room for big arrays doing just the popcount
Not sure how they get away with it, probably because they didn't measure accurately or they didn't care as they have got 70k machines anyway
What really matters is how the resulting program behaves using that specific instruction during the entire game.
There really is no other test than that.
Measuring that accurately is really tricky, as you try to measure something of around afew tenths of a second over a long period of time and the SMP has all sorts of nasty effects together with the OS onto the level caches.
It'll take days to measure that accurately i suspect.
Only then you really know what's faster!
Yet my guess is that it won't be involving with massive big arrays just to count a few bits...
Vincent
p.s. are you Iran-Turkish or Turkish-turkish?