SSE4 instructions

diep · Post by **diep** » Thu May 24, 2012 2:35 pm

mar wrote:
Engin wrote: MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;

return (0);
}
Hi Engin,

yes i also use a LUT in cheng. Also 64K table, 4 lookups.

You also do the useless 'if( a )' clause in front of it?
Without it it's 2 cycles provided your results are in L1d, otherwise you have to get it out of L2 or L3 which is slow.

Doing 8 256-byte lookups proved slower, and any other fancy formulas proved much slower (note i only tested in 32-bit mode).

it's 4 cycles yet lots of problems for L1d to keep up with the result,
also i see the above code uses a '+' (PLUS). You can also use | (OR) of course, which is faster on paper, except L1d problems again.

I'm not sure but i believe only SSE4 popcnt was faster.

Of course it is faster if other instructions already are in SIMD. The SSE4 can do 2 at a time which always is faster. AVX can do 4 at a time which is even faster.

So if you write simple test, usually compiler will fool everyone there.

However even with popcnt the speedup was really negligible so i never enabled it and i believe it's useless. At my level (~2500 elo engine +- rating list elo normalization offset) certainly.
One more binary for how much? 1, 2, 5 or 10 elo at best perhaps? Not worth it IMO unless want to make some nerds happy

No elo at all of course, except if you mess up.

mar · Post by **mar** » Thu May 24, 2012 2:46 pm

diep wrote: You also do the useless 'if( a )' clause in front of it?
Without it it's 2 cycles provided your results are in L1d, otherwise you have to get it out of L2 or L3 which is slow.

Doing 8 256-byte lookups proved slower, and any other fancy formulas proved much slower (note i only tested in 32-bit mode).
it's 4 cycles yet lots of problems for L1d to keep up with the result,
also i see the above code uses a '+' (PLUS). You can also use | (OR) of course, which is faster on paper, except L1d problems again.

Yes I also have a zero-test before the lookups. I'm not a hardware expert and never optimized code to be cache-friendly. I always believed in high-level optimization, unless I was writing a software rasterizer or sample mixer.
Perhaps this obsession comes from the old times where we used to count instruction cycles because we had no cache at all

I don't claim anything, only report my old results.
I hope I tested directly time to depth in a certain position. If i used a huge loop with the same bitboard being fed to the popcount routine then that was certainly a lame test indeed.
PS. You can't use or instead of an addition. what if you have 0x10001. or would return popcnt 1 instead of correct 2.

bob · Post by **bob** » Thu May 24, 2012 5:41 pm

Engin wrote:i just wonder why my lookups are faster....

How are you testing? Your lookup table is 64kb, which is larger than Intel L1 data cache. If you just do repeated popcnt's on the same value, that test is worthless because you get the 4 entries needed loaded into L1. If you test on random 64 bit values, you get a better idea. But even that is not good enough, because that 64kb array will have to compete with all the OTHER data a real chess engine uses, which means that if even one of those accesses misses on L1 cache, you are immediately slower than a popcnt hardware instruction...

Engin · Post by **Engin** » Thu May 24, 2012 5:54 pm

hi vincent,

are you serious or just want kidding me?

yes, this lookups needs only 8 KB, thats acceptable, as i said i am using that only on 32 bit version, of course on 64 bit version i am using the SSE4 popcount, and also the _BitScanForward64 functions for first bit and _BitScanReverse64 for last bit what is extremly faster then my lookup tables.

Engin · Post by **Engin** » Thu May 24, 2012 6:07 pm

..and please we are talking here about SSE instructions not about whole code optimizations or improvements, i am working of course on different places at my sources.

Engin · Post by **Engin** » Thu May 24, 2012 6:24 pm

the if(a) is not userless because its quick return zero if there are not bits

Engin · Post by **Engin** » Thu May 24, 2012 6:31 pm

i am testing with fixed depths and different positions, and notice the time whats need, with 2x32bit and with 4x16 and found thats the lookups are slightly msec faster.

Engin · Post by **Engin** » Thu May 24, 2012 6:35 pm

well i had done that too, its not faster with 32bit version

Engin · Post by **Engin** » Thu May 24, 2012 6:41 pm

yes, for the first view its logicaly to use hw popcount even in 32bit think its faster to using 2x32bit then to do 4x16bit, i cant explain why the lookups are faster, maybe its the hardware cache or something else whatever.

Don · Post by **Don** » Thu May 24, 2012 11:09 pm

MM wrote:Hi all,

how many more knodes and how many elo points or how much more speed can sse4 instructions give an engine?

I don't know anything about programming and i wonder if there's a substantial difference in performance.

Thank you

Best Regards

Komodo seems to benefit more than other programs so your mileage may vary, but it's probably not a huge deal.

SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions