The Stockfish program was not using the 64 bit intrinsics when compiled with the Microsoft compiler. This version uses the 64 bit intrinsics:
http://cap.connx.com/chess-engines/new- ... ran2kdc.7z
A simple change to Stockfish helps the MSC compiler
Moderators: hgm, Rebel, chrisw
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: A simple change to Stockfish helps the MSC compiler
Is there one _documented_ reason why we should use MSVC intrinsics ?Dann Corbit wrote:The Stockfish program was not using the 64 bit intrinsics when compiled with the Microsoft compiler. This version uses the 64 bit intrinsics:
http://cap.connx.com/chess-engines/new- ... ran2kdc.7z
Thanks
Marco
P.S: As a side note we would much better prefer a patch given in a diff format instead of full sources. If instead your post is addressed to users and not to SF developers then sorry for the noise and please ignore my reply.
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: A simple change to Stockfish helps the MSC compiler
Not that I know of. It does make for a faster version when *I* build it, but likely Jim Ablett's builds may still beat it.mcostalba wrote:Is there one _documented_ reason why we should use MSVC intrinsics ?Dann Corbit wrote:The Stockfish program was not using the 64 bit intrinsics when compiled with the Microsoft compiler. This version uses the 64 bit intrinsics:
http://cap.connx.com/chess-engines/new- ... ran2kdc.7z
The two files with a change of significance are bitboard.cpp and bitboard.h.
Here is documentation for the function call:
http://msdn.microsoft.com/en-us/library ... s.90).aspx
The code base that I modified is not the "stock" stockfish, but the fiddled version that uses Miguel's tablebase files and also has a change to the granularity used in searching and a few other things of that nature. I do not know if the other changes (besides tablebase) are even desirable.
My post was for the "stockfish wildcatters" mostly, but (of course) the regular stockfish team is welcome to examine, laugh at, and ignore my changes or even incorporate them if they so desire.Thanks
Marco
P.S: As a side note we would much better prefer a patch given in a diff format instead of full sources. If instead your post is addressed to users and not to SF developers then sorry for the noise and please ignore my reply.
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: A simple change to Stockfish helps the MSC compiler
What do you mean with "faster version" ? Could you please post your test results (including how you measured it) ?Dann Corbit wrote:It does make for a faster version when *I* build it.
Thanks
Marco
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: A simple change to Stockfish helps the MSC compiler
By "faster" I mean that NPS increases for all the searches. There is no other logical change that I measured. IOW, with only the intrinsic, the code runs faster for me.mcostalba wrote:What do you mean with "faster version" ? Could you please post your test results (including how you measured it) ?Dann Corbit wrote:It does make for a faster version when *I* build it.
Thanks
Marco
For instance, for the root position of chess, a 22 ply search without the change gets 2908527 NPS on my machine and 3167383 NPS with the change.
It's a bit under 9% faster, which I find rather surprising. Of course it's probably 1-2 Elo improvement tops.
You may not see the same results, since I use PGO when doing builds and therefore the results are a little bit random.
I use the MS compiler. I do not know if the Intel compiler will have any benefit.
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: A simple change to Stockfish helps the MSC compiler
I guess you are talking about this thing in bitboard.h?
I guess its not surprising that it is noticably faster than the fallback version (debruijn style):
Edit: I would expect the Intel compiler to benefit similarly to the MS compiler. That file already contains a GCC asm version of first_1 and pop_1st_bit that uses a bsfq instruction, so I guess this change just helps the MS/Intel compilers keep up with GCC?
Code: Select all
#elif defined(_M_IA64 ) || defined (_M_X64 )
#include <intrin.h>
__forceinline Square first_1(Bitboard b) {
unsigned long index;
_BitScanForward64(&index, b);
return (Square) index;
}
__forceinline Square pop_1st_bit(Bitboard* b) {
const Square s = first_1(*b);
*b &= ~(1ULL<<s);
return s;
}
#else // if !defined(USE_BSFQ)
Code: Select all
const int BitTable[64] = {
0, 1, 2, 7, 3, 13, 8, 19, 4, 25, 14, 28, 9, 34, 20, 40, 5, 17, 26, 38, 15,
46, 29, 48, 10, 31, 35, 54, 21, 50, 41, 57, 63, 6, 12, 18, 24, 27, 33, 39,
16, 37, 45, 47, 30, 53, 49, 56, 62, 11, 23, 32, 36, 44, 52, 55, 61, 22, 43,
51, 60, 42, 59, 58
};
Square first_1(Bitboard b) {
return Square(BitTable[((b & -b) * 0x218a392cd3d5dbfULL) >> 58]);
}
Square pop_1st_bit(Bitboard* b) {
Bitboard bb = *b;
*b &= (*b - 1);
return Square(BitTable[((bb & -bb) * 0x218a392cd3d5dbfULL) >> 58]);
}
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: A simple change to Stockfish helps the MSC compiler
Yes. That is the change that I was referring to.wgarvin wrote:I guess you are talking about this thing in bitboard.h?
I guess its not surprising that it is noticably faster than the fallback version (debruijn style):Code: Select all
#elif defined(_M_IA64 ) || defined (_M_X64 ) #include <intrin.h> __forceinline Square first_1(Bitboard b) { unsigned long index; _BitScanForward64(&index, b); return (Square) index; } __forceinline Square pop_1st_bit(Bitboard* b) { const Square s = first_1(*b); *b &= ~(1ULL<<s); return s; } #else // if !defined(USE_BSFQ)
Edit: I would expect the Intel compiler to benefit similarly to the MS compiler. That file already contains a GCC asm version of first_1 and pop_1st_bit that uses a bsfq instruction, so I guess this change just helps the MS/Intel compilers keep up with GCC?Code: Select all
const int BitTable[64] = { 0, 1, 2, 7, 3, 13, 8, 19, 4, 25, 14, 28, 9, 34, 20, 40, 5, 17, 26, 38, 15, 46, 29, 48, 10, 31, 35, 54, 21, 50, 41, 57, 63, 6, 12, 18, 24, 27, 33, 39, 16, 37, 45, 47, 30, 53, 49, 56, 62, 11, 23, 32, 36, 44, 52, 55, 61, 22, 43, 51, 60, 42, 59, 58 }; Square first_1(Bitboard b) { return Square(BitTable[((b & -b) * 0x218a392cd3d5dbfULL) >> 58]); } Square pop_1st_bit(Bitboard* b) { Bitboard bb = *b; *b &= (*b - 1); return Square(BitTable[((bb & -bb) * 0x218a392cd3d5dbfULL) >> 58]); }
-
- Posts: 2041
- Joined: Wed Mar 08, 2006 8:30 pm
Re: A simple change to Stockfish helps the MSC compiler
Well, 9% faster is theoretically around +9 Elo...Dann Corbit wrote:It's a bit under 9% faster, which I find rather surprising. Of course it's probably 1-2 Elo improvement tops.
-
- Posts: 454
- Joined: Sat Apr 04, 2009 6:44 pm
- Location: Bulgaria
Re: A simple change to Stockfish helps the MSC compiler
I'm experimenting with an MMX version of popcnt right now. The difference between mine implementation and the one from the AMD optimization guide is the inlining ability, while the AMD function manages own stack frame. So far I couldn't notice any speed difference on C2D cpu between these two and the integer implementation. Could anyone try and say if there is a difference for an engine that extensively is using popcnt and bitboards?
x86:
x64:
p.s. the 64 bit version could be optimized, this is just a quick draft to get it work.
x86:
Code: Select all
static __forceinline unsigned int __stdcall popcnt(unsigned __int64 v)
{
static const __int64 c55 = 0x5555555555555555;
static const __int64 c33 = 0x3333333333333333;
static const __int64 c0f = 0x0f0f0f0f0f0f0f0f;
__asm
{ movd mm0, dword ptr[v]
punpckldq mm0, dword ptr[v+4]
movq mm1, mm0
psrld mm0, 1
pand mm0, [c55]
psubd mm1, mm0
movq mm0, mm1
psrld mm1, 2
pand mm0, [c33]
pand mm1, [c33]
paddd mm0, mm1
movq mm1, mm0
psrld mm0, 4
paddd mm0, mm1
pand mm0, [c0f]
pxor mm1, mm1
psadbw mm0, mm1
movd eax, mm0
emms
}
}
Code: Select all
static __forceinline int popcnt(unsigned __int64 v)
{
static const __int64 C55 = 0x5555555555555555;
static const __int64 C33 = 0x3333333333333333;
static const __int64 C0F = 0x0F0F0F0F0F0F0F0F;
__asm
{ mov rax, qword ptr[v]
movq mm0, rax
movq mm1, mm0
psrld mm0, 1
pand mm0, [C55]
psubd mm1, mm0
movq mm0, mm1
psrld mm1, 2
pand mm0, [C33]
pand mm1, [C33]
paddd mm0, mm1
movq mm1, mm0
psrld mm0, 4
paddd mm0, mm1
pand mm0, [C0F]
pxor mm1, mm1
psadbw mm0, mm1
movd rax, mm0
emms
}
}
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: A simple change to Stockfish helps the MSC compiler
I am doing some popcnt experiments now.
Here is one C code fragment:
Here is one C code fragment:
Code: Select all
//types and constants used in the functions below
typedef unsigned __int64 uint64;// assume this gives 64-bits
static const uint64 mx1 = 0x1111111111111111; // binary: 1000...
static const uint64 mx3 = 0x3333333333333333; // binary: 0011...
static const uint64 mx5 = 0x5555555555555555; // binary: 0101...
static const uint64 mx7 = 0x7777777777777777; // binary: 1110...
static const uint64 mxa = 0xAAAAAAAAAAAAAAAA; // binary: 1010...
static const uint64 mxc = 0xCCCCCCCCCCCCCCCC; // binary: 1100...
static const uint64 mxf = 0xffffffffffffffff; // binary: all ones
static const uint64 m04b = 0x0f0f0f0f0f0f0f0f; // binary: 4 zeros, 4 ones ...
static const uint64 mb04 = 0xF0F0F0F0F0F0F0F0; // binary: 4 ones, 4 zeros ...
static const uint64 m08b = 0x00ff00ff00ff00ff; // binary: 8 zeros, 8 ones ...
static const uint64 mb08 = 0xFF00FF00FF00FF00; // binary: 8 ones, 8 zeros ...
static const uint64 m16b = 0x0000ffff0000ffff; // binary: 16 zeros, 16 ones ...
static const uint64 mb16 = 0xFFFF0000FFFF0000; // binary: 16 ones, 16 zeros ...
static const uint64 m32b = 0x00000000ffffffff; // binary: 32 zeros, 32 ones
static const uint64 mb32 = 0xFFFFFFFF00000000; // binary: 32 ones, 32 zeros
static const uint64 mp = 0x0101010101010101; // the sum of 256 to the power of 0,1,2,3... 10000000...
//This is a naive implementation, shown for comparison,
//and to help in understanding the better functions.
//It uses 24 arithmetic operations (shift, add, and).
int popcount_1(uint64 x)
{
x = (x & mx5) + ((x >> 1) & mx5); // put count of each 2 bits into those 2 bits
x = (x & mx3) + ((x >> 2) & mx3); // put count of each 4 bits into those 4 bits
x = (x & m04b) + ((x >> 4) & m04b); // put count of each 8 bits into those 8 bits
x = (x & m08b) + ((x >> 8) & m08b); // put count of each 16 bits into those 16 bits
x = (x & m16b) + ((x >> 16) & m16b); // put count of each 32 bits into those 32 bits
x = (x & m32b) + ((x >> 32) & m32b); // put count of each 64 bits into those 64 bits
return x;
}
//This uses fewer arithmetic operations than any other known
//implementation on mxachines with slow multiplication.
//It uses 17 arithmetic operations.
int popcount_2(uint64 x)
{
x -= (x >> 1) & mx5; // put count of each 2 bits into those 2 bits
x = (x & mx3) + ((x >> 2) & mx3); // put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m04b; // put count of each 8 bits into those 8 bits
x += x >> 8; // put count of each 16 bits into their lowest 8 bits
x += x >> 16; // put count of each 32 bits into their lowest 8 bits
x += x >> 32; // put count of each 64 bits into their lowest 8 bits
return x & 0x7f;
}
//This uses fewer arithmetic operations than any other known
//implementation on mxachines with fast multiplication.
//It uses 12 arithmetic operations, one of which is a multiply.
int popcount_3(uint64 x)
{
x -= (x >> 1) & mx5; // put count of each 2 bits into those 2 bits
x = (x & mx3) + ((x >> 2) & mx3); // put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m04b; // put count of each 8 bits into those 8 bits
return (x * mp) >> 56; // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
}
//This is better when most bits in x are 0
//It uses 3 arithmetic operations and one comparison/branch per "1" bit in x.
int popcount_4(uint64 x)
{
int count;
for (count = 0; x; count++)
x &= x - 1;
return count;
}
//This is better if most bits in x are 0.
//It uses 2 arithmetic operations and one comparison/branch per "1" bit in x.
//It is the same as the previous function, but with the loop unrolled.
#define f(y) if ((x &= x-1) == 0) return y;
int popcount_5(uint64 x)
{
if (x == 0)
return 0;
f(1) f(2) f(3) f(4) f(5) f(6) f(7) f(8)
f(9) f(10) f(11) f(12) f(13) f(14) f(15) f(16)
f(17) f(18) f(19) f(20) f(21) f(22) f(23) f(24)
f(25) f(26) f(27) f(28) f(29) f(30) f(31) f(32)
f(33) f(34) f(35) f(36) f(37) f(38) f(39) f(40)
f(41) f(42) f(43) f(44) f(45) f(46) f(47) f(48)
f(49) f(50) f(51) f(52) f(53) f(54) f(55) f(56)
f(57) f(58) f(59) f(60) f(61) f(62) f(63)
return 64;
}
int popcount_6(uint64 num)
{
int count = 0;
while (num) {
num &= (num - 1);
count++;
}
return count;
}
unsigned popcount_7(uint64 v)
{
uint64 u;
u = v - ((v >> 1) & mx7) - ((v >> 2) & mx3) - ((v >> 3) & mx1);
return ((u + (u >> 4)) & m04b) % 0xFF; /* Modulus-255 */
}
//Use this instead if most bits in x are 1 instead of 0
//It uses 2 arithmetic operations and one comparison/branch per "0" bit in x.
//It uses the same loop unrolling trick as popcount_5.
#undef f
#define f(y) if ((x |= x+1) == mxf) return 64-y;
int popcount_8(uint64 x)
{
if (x == 0)
return 0;
f(1) f(2) f(3) f(4) f(5) f(6) f(7) f(8)
f(9) f(10) f(11) f(12) f(13) f(14) f(15) f(16)
f(17) f(18) f(19) f(20) f(21) f(22) f(23) f(24)
f(25) f(26) f(27) f(28) f(29) f(30) f(31) f(32)
f(33) f(34) f(35) f(36) f(37) f(38) f(39) f(40)
f(41) f(42) f(43) f(44) f(45) f(46) f(47) f(48)
f(49) f(50) f(51) f(52) f(53) f(54) f(55) f(56)
f(57) f(58) f(59) f(60) f(61) f(62) f(63)
return 64;
}
unsigned popcount_9(uint64 v)
{
uint64 even1 ;
uint64 even2 ;
uint64 even3 ;
uint64 even4 ;
uint64 even5 ;
uint64 even6 ;
uint64 odd1 ;
uint64 odd2 ;
uint64 odd3 ;
uint64 odd4 ;
uint64 odd5 ;
uint64 odd6 ;
uint64 sum2 ;
uint64 sum4 ;
uint64 sum6 ;
uint64 sumx1 ;
uint64 sumx3 ;
uint64 sumx5 ;
// popcount = sum of all 1-bits in v
odd1 = v & mx5;
even1 = v & mxa;
even1 >>= 1;
sumx1 = odd1 + even1;
// popcount = sum of every 2-bit segments in sumx1, e.g. 0110 => 01 (=1) + 10 (=2) = 11 (=3)
odd2 = sumx1 & mx3;
even2 = sumx1 & mxc;
even2 >>= 2;
sum2 = odd2 + even2;
// popcount = sum of every 4-bit segments in sum2, but each 4-bit values is at most 4
odd3 = sum2 & m04b;
even3 = sum2 & mb04;
even3 >>= 4;
sumx3 = odd3 + even3;
// popcount = sum of every 8-bit segments in sumx3, but each 8-bit values is at most 8
odd4 = sumx3 & m08b;
even4 = sumx3 & mb08;
even4 >>= 8;
sum4 = odd4 + even4;
// popcount = sum of every 16-bit segments in sum4, but each 16-bit values is at most 16
odd5 = sum4 & m16b;
even5 = sum4 & mb16;
even5 >>= 16;
sumx5 = odd5 + even5;
// popcount = sum of every 32-bit segments in sumx5, but each 32-bit values is at most 32
odd6 = sumx5 & m32b;
even6 = sumx5 & mb32;
even6 >>= 32;
sum6 = odd6 + even6;
// Now sum6 is at most 64, which is the number of 1-bits
return sum6;
}