You also do the useless 'if( a )' clause in front of it?mar wrote:Hi Engin,Engin wrote: MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;
return (0);
}
yes i also use a LUT in cheng. Also 64K table, 4 lookups.
Without it it's 2 cycles provided your results are in L1d, otherwise you have to get it out of L2 or L3 which is slow.
it's 4 cycles yet lots of problems for L1d to keep up with the result,Doing 8 256-byte lookups proved slower, and any other fancy formulas proved much slower (note i only tested in 32-bit mode).
also i see the above code uses a '+' (PLUS). You can also use | (OR) of course, which is faster on paper, except L1d problems again.
Of course it is faster if other instructions already are in SIMD. The SSE4 can do 2 at a time which always is faster. AVX can do 4 at a time which is even faster.I'm not sure but i believe only SSE4 popcnt was faster.
So if you write simple test, usually compiler will fool everyone there.
No elo at all of course, except if you mess up.However even with popcnt the speedup was really negligible so i never enabled it and i believe it's useless. At my level (~2500 elo engine +- rating list elo normalization offset) certainly.
One more binary for how much? 1, 2, 5 or 10 elo at best perhaps? Not worth it IMO unless want to make some nerds happy
