SSE4 instructions

Discussion of chess software programming and technical issues.

Moderator: Ras

diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: SSE4 instructions

Post by diep »

mar wrote:
Engin wrote: MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;

return (0);
}
Hi Engin,

yes i also use a LUT in cheng. Also 64K table, 4 lookups.
You also do the useless 'if( a )' clause in front of it?
Without it it's 2 cycles provided your results are in L1d, otherwise you have to get it out of L2 or L3 which is slow.
Doing 8 256-byte lookups proved slower, and any other fancy formulas proved much slower (note i only tested in 32-bit mode).
it's 4 cycles yet lots of problems for L1d to keep up with the result,
also i see the above code uses a '+' (PLUS). You can also use | (OR) of course, which is faster on paper, except L1d problems again.
I'm not sure but i believe only SSE4 popcnt was faster.
Of course it is faster if other instructions already are in SIMD. The SSE4 can do 2 at a time which always is faster. AVX can do 4 at a time which is even faster.

So if you write simple test, usually compiler will fool everyone there.
However even with popcnt the speedup was really negligible so i never enabled it and i believe it's useless. At my level (~2500 elo engine +- rating list elo normalization offset) certainly.
One more binary for how much? 1, 2, 5 or 10 elo at best perhaps? Not worth it IMO unless want to make some nerds happy :wink:
No elo at all of course, except if you mess up.
mar
Posts: 2671
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: SSE4 instructions

Post by mar »

diep wrote: You also do the useless 'if( a )' clause in front of it?
Without it it's 2 cycles provided your results are in L1d, otherwise you have to get it out of L2 or L3 which is slow.
Doing 8 256-byte lookups proved slower, and any other fancy formulas proved much slower (note i only tested in 32-bit mode).
it's 4 cycles yet lots of problems for L1d to keep up with the result,
also i see the above code uses a '+' (PLUS). You can also use | (OR) of course, which is faster on paper, except L1d problems again.
Yes I also have a zero-test before the lookups. I'm not a hardware expert and never optimized code to be cache-friendly. I always believed in high-level optimization, unless I was writing a software rasterizer or sample mixer.
Perhaps this obsession comes from the old times where we used to count instruction cycles because we had no cache at all :)
I don't claim anything, only report my old results.
I hope I tested directly time to depth in a certain position. If i used a huge loop with the same bitboard being fed to the popcount routine then that was certainly a lame test indeed.
PS. You can't use or instead of an addition. what if you have 0x10001. or would return popcnt 1 instead of correct 2.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SSE4 instructions

Post by bob »

Engin wrote:i just wonder why my lookups are faster....
How are you testing? Your lookup table is 64kb, which is larger than Intel L1 data cache. If you just do repeated popcnt's on the same value, that test is worthless because you get the 4 entries needed loaded into L1. If you test on random 64 bit values, you get a better idea. But even that is not good enough, because that 64kb array will have to compete with all the OTHER data a real chess engine uses, which means that if even one of those accesses misses on L1 cache, you are immediately slower than a popcnt hardware instruction...
Engin
Posts: 1001
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: SSE4 instructions

Post by Engin »

hi vincent,

are you serious or just want kidding me?

yes, this lookups needs only 8 KB, thats acceptable, as i said i am using that only on 32 bit version, of course on 64 bit version i am using the SSE4 popcount, and also the _BitScanForward64 functions for first bit and _BitScanReverse64 for last bit what is extremly faster then my lookup tables.
Engin
Posts: 1001
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: SSE4 instructions

Post by Engin »

..and please we are talking here about SSE instructions not about whole code optimizations or improvements, i am working of course on different places at my sources.
Engin
Posts: 1001
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: SSE4 instructions

Post by Engin »

the if(a) is not userless because its quick return zero if there are not bits
Engin
Posts: 1001
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: SSE4 instructions

Post by Engin »

i am testing with fixed depths and different positions, and notice the time whats need, with 2x32bit and with 4x16 and found thats the lookups are slightly msec faster.
Engin
Posts: 1001
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: SSE4 instructions

Post by Engin »

well i had done that too, its not faster with 32bit version
Engin
Posts: 1001
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: SSE4 instructions

Post by Engin »

yes, for the first view its logicaly to use hw popcount even in 32bit think its faster to using 2x32bit then to do 4x16bit, i cant explain why the lookups are faster, maybe its the hardware cache or something else whatever.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: SSE4 instructions

Post by Don »

MM wrote:Hi all,

how many more knodes and how many elo points or how much more speed can sse4 instructions give an engine?

I don't know anything about programming and i wonder if there's a substantial difference in performance.

Thank you

Best Regards
Komodo seems to benefit more than other programs so your mileage may vary, but it's probably not a huge deal.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.