SSE4 instructions

bob · Post by **bob** » Wed May 23, 2012 5:39 pm

Engin wrote:
zamar wrote:It depends how heavy program's evaluation function is...

For program with light evaluation it's probably only 2 elo points.
For program with very heavy evaluation it might be even 10 elo points.

Stockfish gains around 4-5 elo points.
how heavy ? light evaluation +2 elo ?

i think its depends on how often you are using the popcount in the evaluation.

If we are just talking popcnt() I don't believe the gain is significant. If we are talking about the full-blown SSE4.2 stuff, there is potential gain there since that provides extra 64 bit registers and a pipeline that can help...

My first test on Nehlem was not very convincing. It _might_ have improved Crafty's speed by a couple of percentage points, no more. But I have taken great pains to avoid popcnt() whenever practical by pre-computing things like mobility scores...

Engin · Post by **Engin** » Wed May 23, 2012 6:35 pm

its not faster then on my pre-calculated 16bit popcount i am do, thats why i prefer my own then the ones with two 32bit popcount.

rvida · Post by **rvida** » Wed May 23, 2012 7:28 pm

bob wrote:
rbarreira wrote: Why? I'm pretty sure that doing "hw_popcount (low_32_bits) + hw_popcount (high_32_bits)" is faster than other methods of doing 64-bit popcnt on a 32-bit architecture.
I am not sure why one would do a pair of 32 bit operations on a machine that is obviously 64 bits (those are the only processors with hardware popcnt)...

He meant 64bit SSE4.2 capable processor running a 32bit OS (+32bit engine obviously). In such scenario it makes sense to use hw popcnt even in 32bit mode.

bob · Post by **bob** » Wed May 23, 2012 7:47 pm

rvida wrote:
bob wrote:
rbarreira wrote: Why? I'm pretty sure that doing "hw_popcount (low_32_bits) + hw_popcount (high_32_bits)" is faster than other methods of doing 64-bit popcnt on a 32-bit architecture.
I am not sure why one would do a pair of 32 bit operations on a machine that is obviously 64 bits (those are the only processors with hardware popcnt)...
He meant 64bit SSE4.2 capable processor running a 32bit OS (+32bit engine obviously). In such scenario it makes sense to use hw popcnt even in 32bit mode.

Even worse, as now you lose ALL the advantages the 64 bit cpu brings. Extra 8 registers for starters...

I'm still amazed that anyone considers doing this (32 bit O/S on a very recent 64 bit cpu...)

bob · Post by **bob** » Wed May 23, 2012 7:48 pm

Engin wrote:its not faster then on my pre-calculated 16bit popcount i am do, thats why i prefer my own then the ones with two 32bit popcount.

hardware popcnt has to be faster if this is done right. You do two 32 bit single instructions and add the results. A lookup requires 4 16-bit memory accesses, which is guaranteed to be slower unless everything is in L1 cache which is unlikely in a chess engine...

Engin · Post by **Engin** » Thu May 24, 2012 12:02 am

hi Richard,
i dont use hw popcount for 32 bit, because my own pre-calculated 16bit lookup tables for popcount ist faster then using two 32bit hw popcount, only in 64 bit its faster and make sense to me.

here are my code look up tables

MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;

return (0);
}

the count_word is a 16 bit (0...65535) pre-calculated array, and i do only count the 4x16 bits

Engin · Post by **Engin** » Thu May 24, 2012 12:28 am

i just wonder why my lookups are faster....

diep · Post by **diep** » Thu May 24, 2012 2:02 am

Engin wrote:hi Richard,
i dont use hw popcount for 32 bit, because my own pre-calculated 16bit lookup tables for popcount ist faster then using two 32bit hw popcount, only in 64 bit its faster and make sense to me.

here are my code look up tables

MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;

return (0);
}

the count_word is a 16 bit (0...65535) pre-calculated array, and i do only count the 4x16 bits

Wow you really progressed in programming skills, congratulations Ustun!
From not knowing how to program at all a year or 2 ago when we spoke for hours online, doing frantic efforts to get on my skype+facebook, to being busy with pretty low level subjects now. Well done!
Oh on what you posted. Just get rid of all that nonsense. Those lookup tables of 2^16 in case they are byte representation it's 65536 bytes.

That's exactly the size of the AMD level1 cache.
So probably at intel you get screwed already as the L1 datacache is 32KB.
In case you are just running a small testprogram, with as i assume RANDOM values each time from where you do the popcount, otherwise your test would be incorrect.

If you don't test random values, then the processor can just lookup from its L1d easily.

When running parallel with some sort of serious software that you didn't cut'n paste, then it's total nonsense to waste such a big lookup table as the program can use that very well.

In fact i can imagine it would really cripple Hyperthreading capabilities of software if you have such massive lookuptables screwing the L1d.

So whatever test you do there, i wouldn't use such big lookuptables to get values out of a bitboard and replace it by alternatives, maybe SSE4a, i assume that you can't afford a sixcore i7 is it? So you probably have an AMD sixcore which is just 1/3 of the price of that i7.

Correct assumption?

Over here i have Xeon L5420s and a core or 64.

They support SSSE, however it's pretty much idiocy to write things in SIMD instructions for a program like Diep which is a 32 bits program that doesn't have 64 bits bitboards.

That said i still need a specialized type of LSB/FSB in Diep as my attacktables store some bits, yet this happens very seldom in evaluation,
as i just use it to figure out in some seldom code whether there is for example 2 pawns attacking the same square. My attacktables give that info obviously as the bits 15..30 have a bit set from each of the opponents pieces. So that's where the LSB/FSB hops in handily.

No popcount need in Diep though. Yet i need that in some other type of software i build in my sparetime.

In Diep the LSB/FSB is 32 bits though. In general less instructions always is the superior speed then. I'm not a fan of magic type calculations. Those can only get handled in the multiplication unit and need a small array which for sure won't be in L1d by then. Very bad for HyperThreading not to mention the newest AMD chip. Multiplications really are a problem at modern CPU's, so if you can avoid them, get rid of them. Lucky there is in 32 bits instructions to convert a bit to an integer and popcount there is this simple while loop that gets you the number of bits quickly, used though not in Diep.

Looping goes really fast at modern cpu's. If you can expect just 1 bit at most to be counted, why not use it?

For some prime numbers i run here, to be precise a sieve, i'm using a 64 bits popcount. Now that runs mostly at AMD barcelona cores here and feeds Tesla GPU's. Fastest way to do 64 bits popcount there also of course isn't with arrays, that's always the inferior solution that is only fast in some immature tests, yet never in reality. I do that by simply some shifting. Benchmarking speed with magic number is pretty fast there in tests, yet i don't trust it for production code, again as doing tests is so limited always; so i'm just shifting there a tad to get the popcount in 64 bits

Nothing beats intuition there. Avoids me weeks of benchmarking with in the end the thing i cut'n pasted from the net there being the fastest way to do a Popcount anyway. I'm sure though that if you'd have SSE4 that might be faster, yet that means entire code needs to run in SSE4 so you can shake that as well.

Of course in AVX it's even faster than SSE4, yet are we sure we are on the same planet then? It means entire code must be in AVX.

What the gimps guys are doing there is total different btw, and a lot slower for me, yet again the important thing there is that i'm running this on AMD and not on intel - they use a small primebase that just fits in side the L1/L2 including the sieve. They don't go outside a 256KB working setsize,
whereas i'm filling it up until 1MB with a bigger primebase.

They're entirely 32 bits code, where i'm entirely 64 bits code. 64 bits is *considerable* faster there than 32 bits, except for storing the primes (can store a difference that just fits in 1 byte).

A bigger primebase has factor 2 less composites for sieving, yet this is all irrelevant here for the popcount, yet it shows you that we all fill up the caches to the limit always and then there is no room for big arrays doing just the popcount

Not sure how they get away with it, probably because they didn't measure accurately or they didn't care as they have got 70k machines anyway

What really matters is how the resulting program behaves using that specific instruction during the entire game.
There really is no other test than that.
Measuring that accurately is really tricky, as you try to measure something of around afew tenths of a second over a long period of time and the SMP has all sorts of nasty effects together with the OS onto the level caches.
It'll take days to measure that accurately i suspect.
Only then you really know what's faster!

Yet my guess is that it won't be involving with massive big arrays just to count a few bits...

Vincent

p.s. are you Iran-Turkish or Turkish-turkish?

lucasart · Post by **lucasart** » Thu May 24, 2012 1:03 pm

Engin wrote:hi Richard,
i dont use hw popcount for 32 bit, because my own pre-calculated 16bit lookup tables for popcount ist faster then using two 32bit hw popcount, only in 64 bit its faster and make sense to me.

here are my code look up tables

MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;

return (0);
}

the count_word is a 16 bit (0...65535) pre-calculated array, and i do only count the 4x16 bits

My computer doesn't have SSE4 popcount instruction. Here's what I use:

=> for a sparsley populated bitboard (on average) which corresponds to most of my use cases

Code: Select all

unsigned count_bit(uint64_t b)
/* General purpose bit counting, using a simple loop */
{
	unsigned count = 0;
	while (b) {
		b &= b - 1;
		count++;
	}
	return count;
}

=> if the bitboard is not sparsley populated but has no more than 15 bits

Code: Select all

unsigned count_bit_max15(uint64_t b)
/* From Stockfish: count bits up to 15, without a loop */
{
	assert(count_bit(b) <= 15);
	b -= (b>>1) & 0x5555555555555555ULL;
	b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
	b *= 0x1111111111111111ULL;
	return b >> 60;
}

=> else: that case never happens in practice... but in that case I suppose your lookup is the best choice

mar · Post by **mar** » Thu May 24, 2012 1:47 pm

Engin wrote: MY_INLINE int my_pop_cnt64 (const uint64_t a)
{
if(a) return (count_word[0xffff & a] +
count_word[0xffff & (a >> 16)] +
count_word[0xffff & (a >> 32)] +
count_word[0xffff & (a >> 48)] ) ;

return (0);
}

Hi Engin,

yes i also use a LUT in cheng. Also 64K table, 4 lookups.
Doing 8 256-byte lookups proved slower, and any other fancy formulas proved much slower (note i only tested in 32-bit mode).
I'm not sure but i believe only SSE4 popcnt was faster.
However even with popcnt the speedup was really negligible so i never enabled it and i believe it's useless. At my level (~2500 elo engine +- rating list elo normalization offset) certainly.
One more binary for how much? 1, 2, 5 or 10 elo at best perhaps? Not worth it IMO unless want to make some nerds happy

SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions

Re: SSE4 instructions