Counting bits...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Counting bits...

Post by mcostalba »

One of most time consuming parts in Stockfish is mobility evaluation. It is very slow because there is a lot of bit counting on many bitboards.

The best solution would be to use POPCNT, but it is not avaiable on every CPU and even the official x64 release binaries by Jim do not take advantage of it (although code is already there) mainly because of the lack of an i7 machine where to compile the POPCNT version of SF.

So bit counting is done with this software implementation:

Code: Select all

inline int count_1s_max_15(Bitboard b) {
  unsigned w = unsigned(b >> 32), v = unsigned(b);
  v -= (v >> 1) & 0x55555555; // 0-2 in 2 bits
  w -= (w >> 1) & 0x55555555;
  v = ((v >> 2) & 0x33333333) + (v & 0x33333333); // 0-4 in 4 bits
  w = ((w >> 2) & 0x33333333) + (w & 0x33333333);
  v += w; // 0-8 in 4 bits
  v *= 0x11111111;
  return int(v >> 28);
}
This is the case where the bits to count are maximum 15 (and is the common case for our needs).

Now the question. Instead of counting bitboard for each piece I was wondering if it was possible to count many bitboards in parallel, IOW translate the above version in a xmms form where we can count in parallel 2 bitboards or even better to use 2 xmms registers so to count up to 4 bitboards in parallel.

This would _greatly_ speed up SF, just as a reference the POPCNT version of SF is about 4% faster (I mean the whole engine, not just the bit counting routines).

Gerd, you are the master of xmms, do you think would be possible to parallelize the above routine ?

Thanks
Marco
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Counting bits...

Post by Gerd Isenberg »

mcostalba wrote: Gerd, you are the master of xmms, do you think would be possible to parallelize the above routine ?

Thanks
Marco
Yes, that should be possible.

For counting multiple bitboards with the same weight you may also consider the odd/major-tick, i.e. 4 popcounts for 15 bitboards.

Gerd
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Counting bits...

Post by mcostalba »

Gerd Isenberg wrote:
mcostalba wrote: Gerd, you are the master of xmms, do you think would be possible to parallelize the above routine ?

Thanks
Marco
Yes, that should be possible.

For counting multiple bitboards with the same weight you may also consider the odd/major-tick, i.e. 4 popcounts for 15 bitboards.

Gerd
Thank, I will look, bit weight is not a problem, I can weight after with a dedicated loop or even with a vectorial product <bit counts vector> X < weight vector >
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Counting bits...

Post by mcostalba »

mcostalba wrote:
Thank, I will look.
I have looked, an interesting trick. Unfortunatly the weights are different for each piece type, so we cannot gain much form there. It is like to 'or' the two bishop bitboard togheter because normally squares are disjointed being on different colored squares so to spare one popcount, but this is just a second order improvment (IOW a little trick that probably does not deserve the code ugliness increase).

I would think the 'best' approach could be to parallelize as much popcount as possible and then vector multiply the array of the results by the array of the weights. This is a very general and clean solution IMHO.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Counting bits...

Post by Gerd Isenberg »

mcostalba wrote: I have looked, an interesting trick. Unfortunatly the weights are different for each piece type, so we cannot gain much form there. It is like to 'or' the two bishop bitboard togheter because normally squares are disjointed being on different colored squares so to spare one popcount, but this is just a second order improvment (IOW a little trick that probably does not deserve the code ugliness increase).

I would think the 'best' approach could be to parallelize as much popcount as possible and then vector multiply the array of the results by the array of the weights. This is a very general and clean solution IMHO.
It should be quite simple to write a similar routine you posted. For instance with intrinsics on the fly and not tested - it popcounts four bitboards in two xmmregs, and returns a vector of four bytes inside an int:

Code: Select all

U32 popCount4max15&#40;const U64 & bb&#91;4&#93;) &#123;
   static const U64 MMX_ALIGN masks&#91;4&#93; = &#123;
      C64&#40;0x5555555555555555&#41;, 
      C64&#40;0x5555555555555555&#41;, 
      C64&#40;0x3333333333333333&#41;,
      C64&#40;0x3333333333333333&#41;
   &#125;;

   const __m128i* pM = &#40;const __m128i*) masks;
   const __m128i* pb = &#40;const __m128i*) &bb;
   __m128i v = pB&#91;0&#93;;
   __m128i w = pB&#91;1&#93;;
   v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
   w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));
   v = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;v, 2&#41;, pM&#91;1&#93;), 
                     _mm_and_si128&#40;v, pM&#91;1&#93;));
   w = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;w, 2&#41;, pM&#91;1&#93;), 
                     _mm_and_si128&#40;w, pM&#91;1&#93;));
   v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
   w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0

   return  _mm_cvtsi128_si32&#40;v&#41;
        + (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
        + (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
        + (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
&#125;
You may also write an own C++ wrapper to use usual C syntax...
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Counting bits...

Post by Gerd Isenberg »

Gerd Isenberg wrote:
mcostalba wrote: I have looked, an interesting trick. Unfortunatly the weights are different for each piece type, so we cannot gain much form there. It is like to 'or' the two bishop bitboard togheter because normally squares are disjointed being on different colored squares so to spare one popcount, but this is just a second order improvment (IOW a little trick that probably does not deserve the code ugliness increase).

I would think the 'best' approach could be to parallelize as much popcount as possible and then vector multiply the array of the results by the array of the weights. This is a very general and clean solution IMHO.
It should be quite simple to write a similar routine you posted. For instance with intrinsics on the fly and not tested - it popcounts four bitboards in two xmmregs, and returns a vector of four bytes inside an int:

Code: Select all

U32 popCount4max15&#40;const U64 & bb&#91;4&#93;) &#123;
   static const U64 MMX_ALIGN masks&#91;6&#93; = &#123;
      C64&#40;0x5555555555555555&#41;, 
      C64&#40;0x5555555555555555&#41;, 
      C64&#40;0x3333333333333333&#41;,
      C64&#40;0x3333333333333333&#41;,
      C64&#40;0x0f0f0f0f0f0f0f0f&#41;,
      C64&#40;0x0f0f0f0f0f0f0f0f&#41;
   &#125;;

   const __m128i* pM = &#40;const __m128i*) masks;
   const __m128i* pb = &#40;const __m128i*) &bb;
   __m128i v = pB&#91;0&#93;;
   __m128i w = pB&#91;1&#93;;
   v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
   w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));
   v = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;v, 2&#41;, pM&#91;1&#93;), 
                     _mm_and_si128&#40;v, pM&#91;1&#93;));
   w = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;w, 2&#41;, pM&#91;1&#93;), 
                     _mm_and_si128&#40;w, pM&#91;1&#93;));
   v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
   w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
   .... add nibbles
   return  _mm_cvtsi128_si32&#40;v&#41;
        + (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
        + (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
        + (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
&#125;
You may also write an own C++ wrapper to use usual C syntax...
oups, I missed the nibble add, but you got the idea how to do...
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Counting bits...

Post by mcostalba »

Gerd Isenberg wrote: oups, I missed the nibble add, but you got the idea how to do...
Hi Gerd,

I tried the following (there was a small compile error in yours)

Code: Select all

  #include <emmintrin.h>
  #define XMM_ALIGN __declspec&#40;align&#40;16&#41;)
  #define C64&#40;constantU64&#41; constantU64##ULL

  int popCount4max15&#40;const Bitboard* bb&#41; &#123;

   static const Bitboard XMM_ALIGN masks&#91;4&#93; = &#123;
      C64&#40;0x5555555555555555&#41;,
      C64&#40;0x5555555555555555&#41;,
      C64&#40;0x3333333333333333&#41;,
      C64&#40;0x3333333333333333&#41;
   &#125;;

   const __m128i* pM = &#40;const __m128i*) masks;
   const __m128i* pb = &#40;const __m128i*) &bb;

   __m128i v = pb&#91;0&#93;;
   __m128i w = pb&#91;1&#93;;
   
   v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
   w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));

   v = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;v, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;v, pM&#91;1&#93;));
   w = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;w, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;w, pM&#91;1&#93;));

   v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
   w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0

   return  _mm_cvtsi128_si32&#40;v&#41;
        + (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
        + (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
        + (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
  &#125;


  void test&#40;)
  &#123;
       Bitboard bb&#91;4&#93;;
       bb&#91;0&#93; = C64&#40;0x0&#41;;
       bb&#91;1&#93; = C64&#40;0x1&#41;;
       bb&#91;2&#93; = C64&#40;0x3&#41;;
       bb&#91;3&#93; = C64&#40;0x7&#41;;

       int res = popCount4max15&#40;bb&#41;;

       std&#58;&#58;cout << res << std&#58;&#58;endl;
  &#125;


But results are not correct. Also the produced assembly I am not sure is so much cheaper then the standard version:

Code: Select all

  int popCount4max15&#40;const Bitboard* bb&#41; &#123;
004100D0  push        ebx  
004100D1  mov         ebx,esp 
004100D3  sub         esp,8 
004100D6  and         esp,0FFFFFFF0h 
004100D9  add         esp,4 
004100DC  push        ebp  
004100DD  mov         ebp,dword ptr &#91;ebx+4&#93; 
004100E0  mov         dword ptr &#91;esp+4&#93;,ebp 
004100E4  mov         ebp,esp 
004100E6  sub         esp,150h 
004100EC  mov         eax,dword ptr &#91;___security_cookie &#40;48E6DCh&#41;&#93; 
004100F1  xor         eax,ebp 
004100F3  mov         dword ptr &#91;ebp-4&#93;,eax 

   static const Bitboard XMM_ALIGN masks&#91;4&#93; = &#123;
      C64&#40;0x5555555555555555&#41;,
      C64&#40;0x5555555555555555&#41;,
      C64&#40;0x3333333333333333&#41;,
      C64&#40;0x3333333333333333&#41;
   &#125;;

   const __m128i* pM = &#40;const __m128i*) masks;
004100F6  mov         dword ptr &#91;ebp-8&#93;,offset masks &#40;47B7E0h&#41; 
   const __m128i* pb = &#40;const __m128i*) &bb;
004100FD  lea         eax,&#91;ebx+8&#93; 
00410100  mov         dword ptr &#91;ebp-0Ch&#93;,eax 

   __m128i v = pb&#91;0&#93;;
00410103  mov         ecx,dword ptr &#91;ebp-0Ch&#93; 
00410106  movdqa      xmm0,xmmword ptr &#91;ecx&#93; 
0041010A  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0 
   __m128i w = pb&#91;1&#93;;
0041010F  mov         edx,dword ptr &#91;ebp-0Ch&#93; 
00410112  movdqa      xmm0,xmmword ptr &#91;edx+10h&#93; 
00410117  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0 
   
   v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
0041011C  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93; 
00410121  psrlw       xmm0,1 
00410126  movdqa      xmmword ptr &#91;ebp-40h&#93;,xmm0 
0041012B  mov         eax,dword ptr &#91;ebp-8&#93; 
0041012E  movdqa      xmm0,xmmword ptr &#91;eax&#93; 
00410132  movdqa      xmm1,xmmword ptr &#91;ebp-40h&#93; 
00410137  pand        xmm1,xmm0 
0041013B  movdqa      xmmword ptr &#91;ebp-50h&#93;,xmm1 
00410140  movdqa      xmm0,xmmword ptr &#91;ebp-50h&#93; 
00410145  movdqa      xmm1,xmmword ptr &#91;ebp-20h&#93; 
0041014A  psubw       xmm1,xmm0 
0041014E  movdqa      xmmword ptr &#91;ebp-60h&#93;,xmm1 
00410153  movdqa      xmm0,xmmword ptr &#91;ebp-60h&#93; 
00410158  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0 
   w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));
0041015D  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93; 
00410162  psrlw       xmm0,1 
00410167  movdqa      xmmword ptr &#91;ebp-70h&#93;,xmm0 
0041016C  mov         ecx,dword ptr &#91;ebp-8&#93; 
0041016F  movdqa      xmm0,xmmword ptr &#91;ecx&#93; 
00410173  movdqa      xmm1,xmmword ptr &#91;ebp-70h&#93; 
00410178  pand        xmm1,xmm0 
0041017C  movdqa      xmmword ptr &#91;ebp-80h&#93;,xmm1 
00410181  movdqa      xmm0,xmmword ptr &#91;ebp-80h&#93; 
00410186  movdqa      xmm1,xmmword ptr &#91;ebp-30h&#93; 
0041018B  psubw       xmm1,xmm0 
0041018F  movdqa      xmmword ptr &#91;ebp-90h&#93;,xmm1 
00410197  movdqa      xmm0,xmmword ptr &#91;ebp-90h&#93; 
0041019F  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0 

   v = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;v, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;v, pM&#91;1&#93;));
004101A4  mov         edx,dword ptr &#91;ebp-8&#93; 
004101A7  movdqa      xmm0,xmmword ptr &#91;edx+10h&#93; 
004101AC  movdqa      xmm1,xmmword ptr &#91;ebp-20h&#93; 
004101B1  pand        xmm1,xmm0 
004101B5  movdqa      xmmword ptr &#91;ebp-0A0h&#93;,xmm1 
004101BD  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93; 
004101C2  psrlw       xmm0,2 
004101C7  movdqa      xmmword ptr &#91;ebp-0B0h&#93;,xmm0 
004101CF  mov         eax,dword ptr &#91;ebp-8&#93; 
004101D2  movdqa      xmm0,xmmword ptr &#91;eax+10h&#93; 
004101D7  movdqa      xmm1,xmmword ptr &#91;ebp-0B0h&#93; 
004101DF  pand        xmm1,xmm0 
004101E3  movdqa      xmmword ptr &#91;ebp-0C0h&#93;,xmm1 
004101EB  movdqa      xmm0,xmmword ptr &#91;ebp-0A0h&#93; 
004101F3  movdqa      xmm1,xmmword ptr &#91;ebp-0C0h&#93; 
004101FB  paddw       xmm1,xmm0 
004101FF  movdqa      xmmword ptr &#91;ebp-0D0h&#93;,xmm1 
00410207  movdqa      xmm0,xmmword ptr &#91;ebp-0D0h&#93; 
0041020F  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0 
   w = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;w, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;w, pM&#91;1&#93;));
00410214  mov         ecx,dword ptr &#91;ebp-8&#93; 
00410217  movdqa      xmm0,xmmword ptr &#91;ecx+10h&#93; 
0041021C  movdqa      xmm1,xmmword ptr &#91;ebp-30h&#93; 
00410221  pand        xmm1,xmm0 
00410225  movdqa      xmmword ptr &#91;ebp-0E0h&#93;,xmm1 
0041022D  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93; 
00410232  psrlw       xmm0,2 
00410237  movdqa      xmmword ptr &#91;ebp-0F0h&#93;,xmm0 
0041023F  mov         edx,dword ptr &#91;ebp-8&#93; 
00410242  movdqa      xmm0,xmmword ptr &#91;edx+10h&#93; 
00410247  movdqa      xmm1,xmmword ptr &#91;ebp-0F0h&#93; 
0041024F  pand        xmm1,xmm0 
00410253  movdqa      xmmword ptr &#91;ebp-100h&#93;,xmm1 
0041025B  movdqa      xmm0,xmmword ptr &#91;ebp-0E0h&#93; 
00410263  movdqa      xmm1,xmmword ptr &#91;ebp-100h&#93; 
0041026B  paddw       xmm1,xmm0 
0041026F  movdqa      xmmword ptr &#91;ebp-110h&#93;,xmm1 
00410277  movdqa      xmm0,xmmword ptr &#91;ebp-110h&#93; 
0041027F  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0 

   v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
00410284  pxor        xmm0,xmm0 
00410288  movdqa      xmmword ptr &#91;ebp-120h&#93;,xmm0 
00410290  movdqa      xmm0,xmmword ptr &#91;ebp-120h&#93; 
00410298  movdqa      xmm1,xmmword ptr &#91;ebp-20h&#93; 
0041029D  psadbw      xmm1,xmm0 
004102A1  movdqa      xmmword ptr &#91;ebp-130h&#93;,xmm1 
004102A9  movdqa      xmm0,xmmword ptr &#91;ebp-130h&#93; 
004102B1  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0 
   w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
004102B6  pxor        xmm0,xmm0 
004102BA  movdqa      xmmword ptr &#91;ebp-140h&#93;,xmm0 
004102C2  movdqa      xmm0,xmmword ptr &#91;ebp-140h&#93; 
004102CA  movdqa      xmm1,xmmword ptr &#91;ebp-30h&#93; 
004102CF  psadbw      xmm1,xmm0 
004102D3  movdqa      xmmword ptr &#91;ebp-150h&#93;,xmm1 
004102DB  movdqa      xmm0,xmmword ptr &#91;ebp-150h&#93; 
004102E3  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0 

   return  _mm_cvtsi128_si32&#40;v&#41;
        + (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
        + (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
        + (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
004102E8  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93; 
004102ED  movd        eax,xmm0 
004102F1  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93; 
004102F6  pextrw      ecx,xmm0,4 
004102FB  shl         ecx,8 
004102FE  add         eax,ecx 
00410300  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93; 
00410305  movd        edx,xmm0 
00410309  shl         edx,10h 
0041030C  add         eax,edx 
0041030E  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93; 
00410313  pextrw      ecx,xmm0,4 
00410318  shl         ecx,18h 
0041031B  add         eax,ecx 
  &#125;
0041031D  mov         ecx,dword ptr &#91;ebp-4&#93; 
00410320  xor         ecx,ebp 
00410322  call        __security_check_cookie &#40;460D30h&#41; 
00410327  mov         esp,ebp 
00410329  pop         ebp  
0041032A  mov         esp,ebx 
0041032C  pop         ebx  
0041032D  ret              
[/code]
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Counting bits...

Post by Gerd Isenberg »

mcostalba wrote:
Gerd Isenberg wrote: oups, I missed the nibble add, but you got the idea how to do...
Hi Gerd,

I tried the following (there was a small compile error in yours)
Sorry, I already said I missed the nibble count so don't take the routine literally, but as a starting point for your own "research". I actually have some stress at work and am too short in time, sorry.

The _mm_sad_epu8 performs the final byte add but no nibble add, so the v = (v + (v >> 4)) & 0x0f0f0f0f0f0f0f0f for a full popcount was missing. If the intrinsics produce that horrorful assembly, you may have a look into the Problem with functions not inlining thread, to probably disable exceptions (Configuration Properties > Code Generation > Enable C++ Exceptions), otherwise use inline asm.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Counting bits...

Post by Desperado »

Hello Marco,

as i remember you are using magicBitboards, if not just
forget what i write now.

Because you mentioned, the most important "bitcounts" are
done for the mobility, you may think about to extend the
MagicTables with a precalculated bitCountValue.
So to say, simple lookup.

Only an idea, of course memory issues are important too....

Unfortunatelly i cannot help with the xmm stuff. good luck

cheers Michael
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Counting bits...

Post by Gerd Isenberg »

mcostalba wrote: Hi Gerd,

I tried the following (there was a small compile error in yours)

But results are not correct. Also the produced assembly I am not sure is so much cheaper then the standard version.
I put a routine on the cpw sse2 page, not restricted to max 15. The 32-bit assembly looks like that, while your posted assembly looks more like from a debug build.

Code: Select all

PUBLIC	?popCount4@@YAHQB_K@Z
CONST	SEGMENT
?masks@?1 
	DQ 5555555555555555H
	DQ 5555555555555555H
	DQ 3333333333333333H
	DQ 3333333333333333H
	DQ 0f0f0f0f0f0f0f0fH
	DQ 0f0f0f0f0f0f0f0fH
; Function compile flags&#58; /Ogtpy
CONST	ENDS
_TEXT	SEGMENT
_bb$ = 8
?popCount4@@YAHQB_K@Z PROC
	push	ebp
	mov	ebp, esp
	and	esp, -16
	mov	eax, DWORD PTR _bb$&#91;ebp&#93;
	movdqa	xmm0, XMMWORD PTR &#91;eax&#93;
	movdqa	xmm3, XMMWORD PTR &#91;eax+16&#93;
	movdqa	xmm4, XMMWORD PTR ?masks@?1
	movdqa	xmm1, xmm0
	psrlq	xmm0, 1
	pand	xmm0, xmm4
	psubb	xmm1, xmm0
	movdqa	xmm0, xmm1
	movdqa	xmm2, xmm3
	psrlq	xmm0, 2
	psrlq	xmm3, 1
	pand	xmm3, xmm4
	movdqa	xmm4, XMMWORD PTR ?masks@?1+32
	psubb	xmm2, xmm3
	movdqa	xmm3, XMMWORD PTR ?masks@?1+16
	pand	xmm1, xmm3
	pand	xmm0, xmm3
	paddb	xmm0, xmm1
	movdqa	xmm1, xmm2
	pand	xmm2, xmm3
	psrlq	xmm1, 2
	pand	xmm1, xmm3
	paddb	xmm1, xmm2
	movdqa	xmm2, xmm0
	psrlq	xmm0, 4
	paddb	xmm2, xmm0
	movdqa	xmm0, xmm1
	psrlq	xmm1, 4
	paddb	xmm0, xmm1
	pand	xmm2, xmm4
	pxor	xmm3, xmm3
	psadbw	xmm2, xmm3
	movdqa	xmm1, xmm2
	pand	xmm0, xmm4
	psadbw	xmm0, xmm3
	punpckhwd xmm2, xmm0
	punpcklwd xmm1, xmm0
	psllq	xmm2, 8
	por	xmm1, xmm2
	movd	eax, xmm1
	mov	esp, ebp
	pop	ebp
	ret	0
?popCount4@@YAHQB_K@Z ENDP
Hope that helps ...