A simple change to Stockfish helps the MSC compiler

Dann Corbit · Post by **Dann Corbit** » Wed Mar 09, 2011 10:43 am

The Stockfish program was not using the 64 bit intrinsics when compiled with the Microsoft compiler. This version uses the 64 bit intrinsics:
http://cap.connx.com/chess-engines/new- ... ran2kdc.7z

mcostalba · Post by **mcostalba** » Wed Mar 09, 2011 12:20 pm

Dann Corbit wrote:The Stockfish program was not using the 64 bit intrinsics when compiled with the Microsoft compiler. This version uses the 64 bit intrinsics:
http://cap.connx.com/chess-engines/new- ... ran2kdc.7z

Is there one _documented_ reason why we should use MSVC intrinsics ?

Thanks
Marco

P.S: As a side note we would much better prefer a patch given in a diff format instead of full sources. If instead your post is addressed to users and not to SF developers then sorry for the noise and please ignore my reply.

Dann Corbit · Post by **Dann Corbit** » Wed Mar 09, 2011 8:35 pm

mcostalba wrote:
Dann Corbit wrote:The Stockfish program was not using the 64 bit intrinsics when compiled with the Microsoft compiler. This version uses the 64 bit intrinsics:
http://cap.connx.com/chess-engines/new- ... ran2kdc.7z
Is there one _documented_ reason why we should use MSVC intrinsics ?

Not that I know of. It does make for a faster version when *I* build it, but likely Jim Ablett's builds may still beat it.
The two files with a change of significance are bitboard.cpp and bitboard.h.
Here is documentation for the function call:
http://msdn.microsoft.com/en-us/library ... s.90).aspx

The code base that I modified is not the "stock" stockfish, but the fiddled version that uses Miguel's tablebase files and also has a change to the granularity used in searching and a few other things of that nature. I do not know if the other changes (besides tablebase) are even desirable.

Thanks
Marco

P.S: As a side note we would much better prefer a patch given in a diff format instead of full sources. If instead your post is addressed to users and not to SF developers then sorry for the noise and please ignore my reply.

My post was for the "stockfish wildcatters" mostly, but (of course) the regular stockfish team is welcome to examine, laugh at, and ignore my changes or even incorporate them if they so desire.

mcostalba · Post by **mcostalba** » Wed Mar 09, 2011 9:18 pm

Dann Corbit wrote:It does make for a faster version when *I* build it.

What do you mean with "faster version" ? Could you please post your test results (including how you measured it) ?

Thanks
Marco

Dann Corbit · Post by **Dann Corbit** » Wed Mar 09, 2011 9:33 pm

mcostalba wrote:
Dann Corbit wrote:It does make for a faster version when *I* build it.
What do you mean with "faster version" ? Could you please post your test results (including how you measured it) ?

Thanks
Marco

By "faster" I mean that NPS increases for all the searches. There is no other logical change that I measured. IOW, with only the intrinsic, the code runs faster for me.

For instance, for the root position of chess, a 22 ply search without the change gets 2908527 NPS on my machine and 3167383 NPS with the change.

It's a bit under 9% faster, which I find rather surprising. Of course it's probably 1-2 Elo improvement tops.

You may not see the same results, since I use PGO when doing builds and therefore the results are a little bit random.

I use the MS compiler. I do not know if the Intel compiler will have any benefit.

wgarvin · Post by **wgarvin** » Wed Mar 09, 2011 10:44 pm

I guess you are talking about this thing in bitboard.h?

Code: Select all

#elif defined&#40;_M_IA64 ) || defined (_M_X64 )
#include <intrin.h>

__forceinline Square first_1&#40;Bitboard b&#41; &#123;
	unsigned long index;
	_BitScanForward64&#40;&index, b&#41;;
	return &#40;Square&#41; index;
&#125;
__forceinline Square pop_1st_bit&#40;Bitboard* b&#41; &#123;
  const Square s = first_1&#40;*b&#41;;
  *b &= ~&#40;1ULL<<s&#41;;
  return s;
&#125;
#else // if !defined&#40;USE_BSFQ&#41;

I guess its not surprising that it is noticably faster than the fallback version (debruijn style):

Code: Select all

const int BitTable&#91;64&#93; = &#123;
  0, 1, 2, 7, 3, 13, 8, 19, 4, 25, 14, 28, 9, 34, 20, 40, 5, 17, 26, 38, 15,
  46, 29, 48, 10, 31, 35, 54, 21, 50, 41, 57, 63, 6, 12, 18, 24, 27, 33, 39,
  16, 37, 45, 47, 30, 53, 49, 56, 62, 11, 23, 32, 36, 44, 52, 55, 61, 22, 43,
  51, 60, 42, 59, 58
&#125;;

Square first_1&#40;Bitboard b&#41; &#123;
  return Square&#40;BitTable&#91;(&#40;b & -b&#41; * 0x218a392cd3d5dbfULL&#41; >> 58&#93;);
&#125;

Square pop_1st_bit&#40;Bitboard* b&#41; &#123;
  Bitboard bb = *b;
  *b &= (*b - 1&#41;;
  return Square&#40;BitTable&#91;(&#40;bb & -bb&#41; * 0x218a392cd3d5dbfULL&#41; >> 58&#93;);
&#125;

Edit: I would expect the Intel compiler to benefit similarly to the MS compiler. That file already contains a GCC asm version of first_1 and pop_1st_bit that uses a bsfq instruction, so I guess this change just helps the MS/Intel compilers keep up with GCC?

Dann Corbit · Post by **Dann Corbit** » Thu Mar 10, 2011 3:30 am

wgarvin wrote:I guess you are talking about this thing in bitboard.h?
Code: Select all
#elif defined&#40;_M_IA64 ) || defined (_M_X64 )
#include <intrin.h>

__forceinline Square first_1&#40;Bitboard b&#41; &#123;
	unsigned long index;
	_BitScanForward64&#40;&index, b&#41;;
	return &#40;Square&#41; index;
&#125;
__forceinline Square pop_1st_bit&#40;Bitboard* b&#41; &#123;
  const Square s = first_1&#40;*b&#41;;
  *b &= ~&#40;1ULL<<s&#41;;
  return s;
&#125;
#else // if !defined&#40;USE_BSFQ&#41;
I guess its not surprising that it is noticably faster than the fallback version (debruijn style):
Code: Select all
const int BitTable&#91;64&#93; = &#123;
  0, 1, 2, 7, 3, 13, 8, 19, 4, 25, 14, 28, 9, 34, 20, 40, 5, 17, 26, 38, 15,
  46, 29, 48, 10, 31, 35, 54, 21, 50, 41, 57, 63, 6, 12, 18, 24, 27, 33, 39,
  16, 37, 45, 47, 30, 53, 49, 56, 62, 11, 23, 32, 36, 44, 52, 55, 61, 22, 43,
  51, 60, 42, 59, 58
&#125;;

Square first_1&#40;Bitboard b&#41; &#123;
  return Square&#40;BitTable&#91;(&#40;b & -b&#41; * 0x218a392cd3d5dbfULL&#41; >> 58&#93;);
&#125;

Square pop_1st_bit&#40;Bitboard* b&#41; &#123;
  Bitboard bb = *b;
  *b &= (*b - 1&#41;;
  return Square&#40;BitTable&#91;(&#40;bb & -bb&#41; * 0x218a392cd3d5dbfULL&#41; >> 58&#93;);
&#125;
Edit: I would expect the Intel compiler to benefit similarly to the MS compiler. That file already contains a GCC asm version of first_1 and pop_1st_bit that uses a bsfq instruction, so I guess this change just helps the MS/Intel compilers keep up with GCC?

Yes. That is the change that I was referring to.

ernest · Post by **ernest** » Wed Mar 16, 2011 8:49 pm

Dann Corbit wrote:It's a bit under 9% faster, which I find rather surprising. Of course it's probably 1-2 Elo improvement tops.

Well, 9% faster is theoretically around +9 Elo...

Mincho Georgiev · Post by **Mincho Georgiev** » Wed Mar 16, 2011 10:27 pm

I'm experimenting with an MMX version of popcnt right now. The difference between mine implementation and the one from the AMD optimization guide is the inlining ability, while the AMD function manages own stack frame. So far I couldn't notice any speed difference on C2D cpu between these two and the integer implementation. Could anyone try and say if there is a difference for an engine that extensively is using popcnt and bitboards?

x86:

Code: Select all

static __forceinline unsigned int __stdcall popcnt&#40;unsigned __int64 v&#41;
&#123;
	static const __int64 c55 = 0x5555555555555555;
	static const __int64 c33 = 0x3333333333333333;
	static const __int64 c0f = 0x0f0f0f0f0f0f0f0f;
	
	__asm 
	&#123;  movd      mm0, dword ptr&#91;v&#93;
		punpckldq mm0, dword ptr&#91;v+4&#93;
		movq      mm1, mm0
		psrld     mm0, 1
		pand      mm0, &#91;c55&#93;
		psubd     mm1, mm0
		movq      mm0, mm1
		psrld     mm1, 2    
		pand      mm0, &#91;c33&#93;
		pand      mm1, &#91;c33&#93;
		paddd     mm0, mm1
		movq      mm1, mm0
		psrld     mm0, 4
		paddd     mm0, mm1
		pand      mm0, &#91;c0f&#93;
		pxor      mm1, mm1
		psadbw    mm0, mm1
		movd      eax, mm0
		emms
	&#125;
&#125;

x64:

Code: Select all

static __forceinline int popcnt&#40;unsigned __int64 v&#41;
&#123;	
	static const __int64 C55 = 0x5555555555555555;
	static const __int64 C33 = 0x3333333333333333;
	static const __int64 C0F = 0x0F0F0F0F0F0F0F0F;
	
	__asm 
	&#123;  mov       rax, qword ptr&#91;v&#93;
		movq      mm0, rax
		movq      mm1, mm0
		psrld     mm0, 1
		pand      mm0, &#91;C55&#93;
		psubd     mm1, mm0
		movq      mm0, mm1
		psrld     mm1, 2
		pand      mm0, &#91;C33&#93;
		pand      mm1, &#91;C33&#93;
		paddd     mm0, mm1
		movq      mm1, mm0
		psrld     mm0, 4
		paddd     mm0, mm1
		pand      mm0, &#91;C0F&#93;
		pxor      mm1, mm1
		psadbw    mm0, mm1
		movd      rax, mm0
		emms
	&#125;
&#125;

p.s. the 64 bit version could be optimized, this is just a quick draft to get it work.

Dann Corbit · Post by **Dann Corbit** » Wed Mar 16, 2011 11:35 pm

I am doing some popcnt experiments now.
Here is one C code fragment:

Code: Select all

//types and constants used in the functions below
typedef unsigned __int64 uint64;// assume this gives 64-bits
static const uint64    mx1 = 0x1111111111111111;        // binary&#58; 1000...
static const uint64    mx3 = 0x3333333333333333;        // binary&#58; 0011...
static const uint64    mx5 = 0x5555555555555555;        // binary&#58; 0101...
static const uint64    mx7 = 0x7777777777777777;        // binary&#58; 1110...
static const uint64    mxa = 0xAAAAAAAAAAAAAAAA;        // binary&#58; 1010...
static const uint64    mxc = 0xCCCCCCCCCCCCCCCC;        // binary&#58; 1100...
static const uint64    mxf = 0xffffffffffffffff;        // binary&#58; all ones

static const uint64    m04b = 0x0f0f0f0f0f0f0f0f;        // binary&#58;  4 zeros,  4 ones ...
static const uint64    mb04 = 0xF0F0F0F0F0F0F0F0;        // binary&#58;  4 ones,  4 zeros ...
static const uint64    m08b = 0x00ff00ff00ff00ff;        // binary&#58;  8 zeros,  8 ones ...
static const uint64    mb08 = 0xFF00FF00FF00FF00;        // binary&#58;  8 ones,  8 zeros ...
static const uint64    m16b = 0x0000ffff0000ffff;        // binary&#58; 16 zeros, 16 ones ...
static const uint64    mb16 = 0xFFFF0000FFFF0000;        // binary&#58; 16 ones, 16 zeros ...
static const uint64    m32b = 0x00000000ffffffff;        // binary&#58; 32 zeros, 32 ones
static const uint64    mb32 = 0xFFFFFFFF00000000;        // binary&#58; 32 ones, 32 zeros

static const uint64    mp = 0x0101010101010101;        // the sum of 256 to the power of 0,1,2,3... 10000000...

//This is a naive implementation, shown for comparison,
//and to help in understanding the better functions.
//It uses 24 arithmetic operations &#40;shift, add, and&#41;.
int             popcount_1&#40;uint64 x&#41;
&#123;
    x = &#40;x & mx5&#41; + (&#40;x >> 1&#41; & mx5&#41;;     // put count of each  2 bits into those  2 bits
    x = &#40;x & mx3&#41; + (&#40;x >> 2&#41; & mx3&#41;;     // put count of each  4 bits into those  4 bits
    x = &#40;x & m04b&#41; + (&#40;x >> 4&#41; & m04b&#41;;     // put count of each  8 bits into those  8 bits
    x = &#40;x & m08b&#41; + (&#40;x >> 8&#41; & m08b&#41;;     // put count of each 16 bits into those 16 bits
    x = &#40;x & m16b&#41; + (&#40;x >> 16&#41; & m16b&#41;;    // put count of each 32 bits into those 32 bits
    x = &#40;x & m32b&#41; + (&#40;x >> 32&#41; & m32b&#41;;    // put count of each 64 bits into those 64 bits
    return x;
&#125;

//This uses fewer arithmetic operations than any other known
//implementation on mxachines with slow multiplication.
//It uses 17 arithmetic operations.
int             popcount_2&#40;uint64 x&#41;
&#123;
    x -= &#40;x >> 1&#41; & mx5;         // put count of each 2 bits into those 2 bits
    x = &#40;x & mx3&#41; + (&#40;x >> 2&#41; & mx3&#41;;     // put count of each 4 bits into those 4 bits
    x = &#40;x + &#40;x >> 4&#41;) & m04b;    // put count of each 8 bits into those 8 bits
    x += x >> 8;                // put count of each 16 bits into their lowest 8 bits
    x += x >> 16;               // put count of each 32 bits into their lowest 8 bits
    x += x >> 32;               // put count of each 64 bits into their lowest 8 bits
    return x & 0x7f;
&#125;

//This uses fewer arithmetic operations than any other known
//implementation on mxachines with fast multiplication.
//It uses 12 arithmetic operations, one of which is a multiply.
int             popcount_3&#40;uint64 x&#41;
&#123;
    x -= &#40;x >> 1&#41; & mx5;         // put count of each 2 bits into those 2 bits
    x = &#40;x & mx3&#41; + (&#40;x >> 2&#41; & mx3&#41;;     // put count of each 4 bits into those 4 bits
    x = &#40;x + &#40;x >> 4&#41;) & m04b;    // put count of each 8 bits into those 8 bits
    return &#40;x * mp&#41; >> 56;      // returns left 8 bits of x + &#40;x<<8&#41; + &#40;x<<16&#41; + &#40;x<<24&#41; + ...
&#125;

//This is better when most bits in x are 0
//It uses 3 arithmetic operations and one comparison/branch per "1" bit in x.
int             popcount_4&#40;uint64 x&#41;
&#123;
    int          count;
    for &#40;count = 0; x; count++)
        x &= x - 1;
    return count;
&#125;

//This is better if most bits in x are 0.
//It uses 2 arithmetic operations and one comparison/branch  per "1" bit in x.
//It is the same as the previous function, but with the loop unrolled.
#define f&#40;y&#41; if (&#40;x &= x-1&#41; == 0&#41; return y;
int             popcount_5&#40;uint64 x&#41;
&#123;
    if &#40;x == 0&#41;
        return 0;
    f&#40;1&#41; f&#40;2&#41; f&#40;3&#41; f&#40;4&#41; f&#40;5&#41; f&#40;6&#41; f&#40;7&#41; f&#40;8&#41;
    f&#40;9&#41; f&#40;10&#41; f&#40;11&#41; f&#40;12&#41; f&#40;13&#41; f&#40;14&#41; f&#40;15&#41; f&#40;16&#41;
    f&#40;17&#41; f&#40;18&#41; f&#40;19&#41; f&#40;20&#41; f&#40;21&#41; f&#40;22&#41; f&#40;23&#41; f&#40;24&#41;
    f&#40;25&#41; f&#40;26&#41; f&#40;27&#41; f&#40;28&#41; f&#40;29&#41; f&#40;30&#41; f&#40;31&#41; f&#40;32&#41;
    f&#40;33&#41; f&#40;34&#41; f&#40;35&#41; f&#40;36&#41; f&#40;37&#41; f&#40;38&#41; f&#40;39&#41; f&#40;40&#41;
    f&#40;41&#41; f&#40;42&#41; f&#40;43&#41; f&#40;44&#41; f&#40;45&#41; f&#40;46&#41; f&#40;47&#41; f&#40;48&#41;
    f&#40;49&#41; f&#40;50&#41; f&#40;51&#41; f&#40;52&#41; f&#40;53&#41; f&#40;54&#41; f&#40;55&#41; f&#40;56&#41;
    f&#40;57&#41; f&#40;58&#41; f&#40;59&#41; f&#40;60&#41; f&#40;61&#41; f&#40;62&#41; f&#40;63&#41;
    return 64;
&#125;

int             popcount_6&#40;uint64 num&#41;
&#123;
    int             count = 0;
    while &#40;num&#41; &#123;
        num &= &#40;num - 1&#41;;
        count++;
    &#125;
    return count;
&#125;

unsigned        popcount_7&#40;uint64 v&#41;
&#123;
    uint64          u;
    u = v - (&#40;v >> 1&#41; & mx7&#41; - (&#40;v >> 2&#41; & mx3&#41; - (&#40;v >> 3&#41; & mx1&#41;;
    return (&#40;u + &#40;u >> 4&#41;) & m04b&#41; % 0xFF;        /* Modulus-255 */
&#125;

//Use this instead if most bits in x are 1 instead of 0
//It uses 2 arithmetic operations and one comparison/branch  per "0" bit in x.
//It uses the same loop unrolling trick as popcount_5.
#undef f
#define f&#40;y&#41; if (&#40;x |= x+1&#41; == mxf&#41; return 64-y;
int             popcount_8&#40;uint64 x&#41;
&#123;
    if &#40;x == 0&#41;
        return 0;
    f&#40;1&#41; f&#40;2&#41; f&#40;3&#41; f&#40;4&#41; f&#40;5&#41; f&#40;6&#41; f&#40;7&#41; f&#40;8&#41;
    f&#40;9&#41; f&#40;10&#41; f&#40;11&#41; f&#40;12&#41; f&#40;13&#41; f&#40;14&#41; f&#40;15&#41; f&#40;16&#41;
    f&#40;17&#41; f&#40;18&#41; f&#40;19&#41; f&#40;20&#41; f&#40;21&#41; f&#40;22&#41; f&#40;23&#41; f&#40;24&#41;
    f&#40;25&#41; f&#40;26&#41; f&#40;27&#41; f&#40;28&#41; f&#40;29&#41; f&#40;30&#41; f&#40;31&#41; f&#40;32&#41;
    f&#40;33&#41; f&#40;34&#41; f&#40;35&#41; f&#40;36&#41; f&#40;37&#41; f&#40;38&#41; f&#40;39&#41; f&#40;40&#41;
    f&#40;41&#41; f&#40;42&#41; f&#40;43&#41; f&#40;44&#41; f&#40;45&#41; f&#40;46&#41; f&#40;47&#41; f&#40;48&#41;
    f&#40;49&#41; f&#40;50&#41; f&#40;51&#41; f&#40;52&#41; f&#40;53&#41; f&#40;54&#41; f&#40;55&#41; f&#40;56&#41;
    f&#40;57&#41; f&#40;58&#41; f&#40;59&#41; f&#40;60&#41; f&#40;61&#41; f&#40;62&#41; f&#40;63&#41;
    return 64;
&#125;


unsigned        popcount_9&#40;uint64 v&#41;
&#123;
    uint64          even1 ;
    uint64          even2 ;
    uint64          even3 ;
    uint64          even4 ;
    uint64          even5 ;
    uint64          even6 ;
    uint64          odd1 ;
    uint64          odd2 ;
    uint64          odd3 ;
    uint64          odd4 ;
    uint64          odd5 ;
    uint64          odd6 ;
    uint64          sum2 ;
    uint64          sum4 ;
    uint64          sum6 ;
    uint64          sumx1 ;
    uint64          sumx3 ;
    uint64          sumx5 ;
    // popcount = sum of all 1-bits in v
    odd1 = v & mx5;
    even1 = v & mxa;
    even1 >>= 1;
    sumx1 = odd1 + even1;
    // popcount = sum of every 2-bit segments in sumx1, e.g. 0110 => 01 (=1&#41; + 10 (=2&#41; = 11 (=3&#41;
    odd2 = sumx1 & mx3;
    even2 = sumx1 & mxc;
    even2 >>= 2;
    sum2 = odd2 + even2;
    // popcount = sum of every 4-bit segments in sum2, but each 4-bit values is at most 4
    odd3 = sum2 & m04b;
    even3 = sum2 & mb04;
    even3 >>= 4;
    sumx3 = odd3 + even3;
    // popcount = sum of every 8-bit segments in sumx3, but each 8-bit values is at most 8
    odd4 = sumx3 & m08b;
    even4 = sumx3 & mb08;
    even4 >>= 8;
    sum4 = odd4 + even4;
    // popcount = sum of every 16-bit segments in sum4, but each 16-bit values is at most 16
    odd5 = sum4 & m16b;
    even5 = sum4 & mb16;
    even5 >>= 16;
    sumx5 = odd5 + even5;
    // popcount = sum of every 32-bit segments in sumx5, but each 32-bit values is at most 32
    odd6 = sumx5 & m32b;
    even6 = sumx5 & mb32;
    even6 >>= 32;
    sum6 = odd6 + even6;
    // Now sum6 is at most 64, which is the number of 1-bits
    return sum6;
&#125;

A simple change to Stockfish helps the MSC compiler

A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler

Re: A simple change to Stockfish helps the MSC compiler