## Counting bits...

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, bob, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### Counting bits...

One of most time consuming parts in Stockfish is mobility evaluation. It is very slow because there is a lot of bit counting on many bitboards.

The best solution would be to use POPCNT, but it is not avaiable on every CPU and even the official x64 release binaries by Jim do not take advantage of it (although code is already there) mainly because of the lack of an i7 machine where to compile the POPCNT version of SF.

So bit counting is done with this software implementation:

Code: Select all

``````inline int count_1s_max_15&#40;Bitboard b&#41; &#123;
unsigned w = unsigned&#40;b >> 32&#41;, v = unsigned&#40;b&#41;;
v -= &#40;v >> 1&#41; & 0x55555555; // 0-2 in 2 bits
w -= &#40;w >> 1&#41; & 0x55555555;
v = (&#40;v >> 2&#41; & 0x33333333&#41; + &#40;v & 0x33333333&#41;; // 0-4 in 4 bits
w = (&#40;w >> 2&#41; & 0x33333333&#41; + &#40;w & 0x33333333&#41;;
v += w; // 0-8 in 4 bits
v *= 0x11111111;
return int&#40;v >> 28&#41;;
&#125;
``````
This is the case where the bits to count are maximum 15 (and is the common case for our needs).

Now the question. Instead of counting bitboard for each piece I was wondering if it was possible to count many bitboards in parallel, IOW translate the above version in a xmms form where we can count in parallel 2 bitboards or even better to use 2 xmms registers so to count up to 4 bitboards in parallel.

This would _greatly_ speed up SF, just as a reference the POPCNT version of SF is about 4% faster (I mean the whole engine, not just the bit counting routines).

Gerd, you are the master of xmms, do you think would be possible to parallelize the above routine ?

Thanks
Marco

Gerd Isenberg
Posts: 2208
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

### Re: Counting bits...

mcostalba wrote: Gerd, you are the master of xmms, do you think would be possible to parallelize the above routine ?

Thanks
Marco
Yes, that should be possible.

For counting multiple bitboards with the same weight you may also consider the odd/major-tick, i.e. 4 popcounts for 15 bitboards.

Gerd

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### Re: Counting bits...

Gerd Isenberg wrote:
mcostalba wrote: Gerd, you are the master of xmms, do you think would be possible to parallelize the above routine ?

Thanks
Marco
Yes, that should be possible.

For counting multiple bitboards with the same weight you may also consider the odd/major-tick, i.e. 4 popcounts for 15 bitboards.

Gerd
Thank, I will look, bit weight is not a problem, I can weight after with a dedicated loop or even with a vectorial product <bit counts vector> X < weight vector >

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### Re: Counting bits...

mcostalba wrote:
Thank, I will look.
I have looked, an interesting trick. Unfortunatly the weights are different for each piece type, so we cannot gain much form there. It is like to 'or' the two bishop bitboard togheter because normally squares are disjointed being on different colored squares so to spare one popcount, but this is just a second order improvment (IOW a little trick that probably does not deserve the code ugliness increase).

I would think the 'best' approach could be to parallelize as much popcount as possible and then vector multiply the array of the results by the array of the weights. This is a very general and clean solution IMHO.

Gerd Isenberg
Posts: 2208
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

### Re: Counting bits...

mcostalba wrote: I have looked, an interesting trick. Unfortunatly the weights are different for each piece type, so we cannot gain much form there. It is like to 'or' the two bishop bitboard togheter because normally squares are disjointed being on different colored squares so to spare one popcount, but this is just a second order improvment (IOW a little trick that probably does not deserve the code ugliness increase).

I would think the 'best' approach could be to parallelize as much popcount as possible and then vector multiply the array of the results by the array of the weights. This is a very general and clean solution IMHO.
It should be quite simple to write a similar routine you posted. For instance with intrinsics on the fly and not tested - it popcounts four bitboards in two xmmregs, and returns a vector of four bytes inside an int:

Code: Select all

``````U32 popCount4max15&#40;const U64 & bb&#91;4&#93;) &#123;
static const U64 MMX_ALIGN masks&#91;4&#93; = &#123;
C64&#40;0x5555555555555555&#41;,
C64&#40;0x5555555555555555&#41;,
C64&#40;0x3333333333333333&#41;,
C64&#40;0x3333333333333333&#41;
&#125;;

const __m128i* pM = &#40;const __m128i*) masks;
const __m128i* pb = &#40;const __m128i*) &bb;
__m128i v = pB&#91;0&#93;;
__m128i w = pB&#91;1&#93;;
v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));
_mm_and_si128&#40;v, pM&#91;1&#93;));
_mm_and_si128&#40;w, pM&#91;1&#93;));
v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0

return  _mm_cvtsi128_si32&#40;v&#41;
+ (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
+ (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
+ (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
&#125;
``````
You may also write an own C++ wrapper to use usual C syntax...

Gerd Isenberg
Posts: 2208
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

### Re: Counting bits...

Gerd Isenberg wrote:
mcostalba wrote: I have looked, an interesting trick. Unfortunatly the weights are different for each piece type, so we cannot gain much form there. It is like to 'or' the two bishop bitboard togheter because normally squares are disjointed being on different colored squares so to spare one popcount, but this is just a second order improvment (IOW a little trick that probably does not deserve the code ugliness increase).

I would think the 'best' approach could be to parallelize as much popcount as possible and then vector multiply the array of the results by the array of the weights. This is a very general and clean solution IMHO.
It should be quite simple to write a similar routine you posted. For instance with intrinsics on the fly and not tested - it popcounts four bitboards in two xmmregs, and returns a vector of four bytes inside an int:

Code: Select all

``````U32 popCount4max15&#40;const U64 & bb&#91;4&#93;) &#123;
static const U64 MMX_ALIGN masks&#91;6&#93; = &#123;
C64&#40;0x5555555555555555&#41;,
C64&#40;0x5555555555555555&#41;,
C64&#40;0x3333333333333333&#41;,
C64&#40;0x3333333333333333&#41;,
C64&#40;0x0f0f0f0f0f0f0f0f&#41;,
C64&#40;0x0f0f0f0f0f0f0f0f&#41;
&#125;;

const __m128i* pM = &#40;const __m128i*) masks;
const __m128i* pb = &#40;const __m128i*) &bb;
__m128i v = pB&#91;0&#93;;
__m128i w = pB&#91;1&#93;;
v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));
_mm_and_si128&#40;v, pM&#91;1&#93;));
_mm_and_si128&#40;w, pM&#91;1&#93;));
v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
return  _mm_cvtsi128_si32&#40;v&#41;
+ (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
+ (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
+ (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
&#125;
``````
You may also write an own C++ wrapper to use usual C syntax...
oups, I missed the nibble add, but you got the idea how to do...

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### Re: Counting bits...

Gerd Isenberg wrote: oups, I missed the nibble add, but you got the idea how to do...
Hi Gerd,

I tried the following (there was a small compile error in yours)

Code: Select all

``````  #include <emmintrin.h>
#define XMM_ALIGN __declspec&#40;align&#40;16&#41;)
#define C64&#40;constantU64&#41; constantU64##ULL

int popCount4max15&#40;const Bitboard* bb&#41; &#123;

static const Bitboard XMM_ALIGN masks&#91;4&#93; = &#123;
C64&#40;0x5555555555555555&#41;,
C64&#40;0x5555555555555555&#41;,
C64&#40;0x3333333333333333&#41;,
C64&#40;0x3333333333333333&#41;
&#125;;

const __m128i* pM = &#40;const __m128i*) masks;
const __m128i* pb = &#40;const __m128i*) &bb;

__m128i v = pb&#91;0&#93;;
__m128i w = pb&#91;1&#93;;

v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));

v = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;v, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;v, pM&#91;1&#93;));
w = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;w, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;w, pM&#91;1&#93;));

v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0

return  _mm_cvtsi128_si32&#40;v&#41;
+ (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
+ (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
+ (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
&#125;

void test&#40;)
&#123;
Bitboard bb&#91;4&#93;;
bb&#91;0&#93; = C64&#40;0x0&#41;;
bb&#91;1&#93; = C64&#40;0x1&#41;;
bb&#91;2&#93; = C64&#40;0x3&#41;;
bb&#91;3&#93; = C64&#40;0x7&#41;;

int res = popCount4max15&#40;bb&#41;;

std&#58;&#58;cout << res << std&#58;&#58;endl;
&#125;

``````

But results are not correct. Also the produced assembly I am not sure is so much cheaper then the standard version:

Code: Select all

``````  int popCount4max15&#40;const Bitboard* bb&#41; &#123;
004100D0  push        ebx
004100D1  mov         ebx,esp
004100D3  sub         esp,8
004100D6  and         esp,0FFFFFFF0h
004100DC  push        ebp
004100DD  mov         ebp,dword ptr &#91;ebx+4&#93;
004100E0  mov         dword ptr &#91;esp+4&#93;,ebp
004100E4  mov         ebp,esp
004100E6  sub         esp,150h
004100EC  mov         eax,dword ptr &#91;___security_cookie &#40;48E6DCh&#41;&#93;
004100F1  xor         eax,ebp
004100F3  mov         dword ptr &#91;ebp-4&#93;,eax

static const Bitboard XMM_ALIGN masks&#91;4&#93; = &#123;
C64&#40;0x5555555555555555&#41;,
C64&#40;0x5555555555555555&#41;,
C64&#40;0x3333333333333333&#41;,
C64&#40;0x3333333333333333&#41;
&#125;;

const __m128i* pM = &#40;const __m128i*) masks;
004100F6  mov         dword ptr &#91;ebp-8&#93;,offset masks &#40;47B7E0h&#41;
const __m128i* pb = &#40;const __m128i*) &bb;
004100FD  lea         eax,&#91;ebx+8&#93;
00410100  mov         dword ptr &#91;ebp-0Ch&#93;,eax

__m128i v = pb&#91;0&#93;;
00410103  mov         ecx,dword ptr &#91;ebp-0Ch&#93;
00410106  movdqa      xmm0,xmmword ptr &#91;ecx&#93;
0041010A  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0
__m128i w = pb&#91;1&#93;;
0041010F  mov         edx,dword ptr &#91;ebp-0Ch&#93;
00410112  movdqa      xmm0,xmmword ptr &#91;edx+10h&#93;
00410117  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0

v = _mm_sub_epi16&#40;v, _mm_and_si128&#40;_mm_srli_epi16&#40;v, 1&#41;, pM&#91;0&#93;));
0041011C  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93;
00410121  psrlw       xmm0,1
00410126  movdqa      xmmword ptr &#91;ebp-40h&#93;,xmm0
0041012B  mov         eax,dword ptr &#91;ebp-8&#93;
0041012E  movdqa      xmm0,xmmword ptr &#91;eax&#93;
00410132  movdqa      xmm1,xmmword ptr &#91;ebp-40h&#93;
00410137  pand        xmm1,xmm0
0041013B  movdqa      xmmword ptr &#91;ebp-50h&#93;,xmm1
00410140  movdqa      xmm0,xmmword ptr &#91;ebp-50h&#93;
00410145  movdqa      xmm1,xmmword ptr &#91;ebp-20h&#93;
0041014A  psubw       xmm1,xmm0
0041014E  movdqa      xmmword ptr &#91;ebp-60h&#93;,xmm1
00410153  movdqa      xmm0,xmmword ptr &#91;ebp-60h&#93;
00410158  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0
w = _mm_sub_epi16&#40;w, _mm_and_si128&#40;_mm_srli_epi16&#40;w, 1&#41;, pM&#91;0&#93;));
0041015D  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93;
00410162  psrlw       xmm0,1
00410167  movdqa      xmmword ptr &#91;ebp-70h&#93;,xmm0
0041016C  mov         ecx,dword ptr &#91;ebp-8&#93;
0041016F  movdqa      xmm0,xmmword ptr &#91;ecx&#93;
00410173  movdqa      xmm1,xmmword ptr &#91;ebp-70h&#93;
00410178  pand        xmm1,xmm0
0041017C  movdqa      xmmword ptr &#91;ebp-80h&#93;,xmm1
00410181  movdqa      xmm0,xmmword ptr &#91;ebp-80h&#93;
00410186  movdqa      xmm1,xmmword ptr &#91;ebp-30h&#93;
0041018B  psubw       xmm1,xmm0
0041018F  movdqa      xmmword ptr &#91;ebp-90h&#93;,xmm1
00410197  movdqa      xmm0,xmmword ptr &#91;ebp-90h&#93;
0041019F  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0

v = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;v, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;v, pM&#91;1&#93;));
004101A4  mov         edx,dword ptr &#91;ebp-8&#93;
004101A7  movdqa      xmm0,xmmword ptr &#91;edx+10h&#93;
004101AC  movdqa      xmm1,xmmword ptr &#91;ebp-20h&#93;
004101B1  pand        xmm1,xmm0
004101B5  movdqa      xmmword ptr &#91;ebp-0A0h&#93;,xmm1
004101BD  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93;
004101C2  psrlw       xmm0,2
004101C7  movdqa      xmmword ptr &#91;ebp-0B0h&#93;,xmm0
004101CF  mov         eax,dword ptr &#91;ebp-8&#93;
004101D2  movdqa      xmm0,xmmword ptr &#91;eax+10h&#93;
004101D7  movdqa      xmm1,xmmword ptr &#91;ebp-0B0h&#93;
004101DF  pand        xmm1,xmm0
004101E3  movdqa      xmmword ptr &#91;ebp-0C0h&#93;,xmm1
004101EB  movdqa      xmm0,xmmword ptr &#91;ebp-0A0h&#93;
004101F3  movdqa      xmm1,xmmword ptr &#91;ebp-0C0h&#93;
004101FF  movdqa      xmmword ptr &#91;ebp-0D0h&#93;,xmm1
00410207  movdqa      xmm0,xmmword ptr &#91;ebp-0D0h&#93;
0041020F  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0
w = _mm_add_epi16&#40;_mm_and_si128&#40;_mm_srli_epi16&#40;w, 2&#41;, pM&#91;1&#93;), _mm_and_si128&#40;w, pM&#91;1&#93;));
00410214  mov         ecx,dword ptr &#91;ebp-8&#93;
00410217  movdqa      xmm0,xmmword ptr &#91;ecx+10h&#93;
0041021C  movdqa      xmm1,xmmword ptr &#91;ebp-30h&#93;
00410221  pand        xmm1,xmm0
00410225  movdqa      xmmword ptr &#91;ebp-0E0h&#93;,xmm1
0041022D  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93;
00410232  psrlw       xmm0,2
00410237  movdqa      xmmword ptr &#91;ebp-0F0h&#93;,xmm0
0041023F  mov         edx,dword ptr &#91;ebp-8&#93;
00410242  movdqa      xmm0,xmmword ptr &#91;edx+10h&#93;
00410247  movdqa      xmm1,xmmword ptr &#91;ebp-0F0h&#93;
0041024F  pand        xmm1,xmm0
00410253  movdqa      xmmword ptr &#91;ebp-100h&#93;,xmm1
0041025B  movdqa      xmm0,xmmword ptr &#91;ebp-0E0h&#93;
00410263  movdqa      xmm1,xmmword ptr &#91;ebp-100h&#93;
0041026F  movdqa      xmmword ptr &#91;ebp-110h&#93;,xmm1
00410277  movdqa      xmm0,xmmword ptr &#91;ebp-110h&#93;
0041027F  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0

v = _mm_sad_epu8 &#40;v, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
00410284  pxor        xmm0,xmm0
00410288  movdqa      xmmword ptr &#91;ebp-120h&#93;,xmm0
00410290  movdqa      xmm0,xmmword ptr &#91;ebp-120h&#93;
00410298  movdqa      xmm1,xmmword ptr &#91;ebp-20h&#93;
004102A1  movdqa      xmmword ptr &#91;ebp-130h&#93;,xmm1
004102A9  movdqa      xmm0,xmmword ptr &#91;ebp-130h&#93;
004102B1  movdqa      xmmword ptr &#91;ebp-20h&#93;,xmm0
w = _mm_sad_epu8 &#40;w, _mm_setzero_si128 ()); // sum bytes 15..8&#58;7..0
004102B6  pxor        xmm0,xmm0
004102BA  movdqa      xmmword ptr &#91;ebp-140h&#93;,xmm0
004102C2  movdqa      xmm0,xmmword ptr &#91;ebp-140h&#93;
004102CA  movdqa      xmm1,xmmword ptr &#91;ebp-30h&#93;
004102D3  movdqa      xmmword ptr &#91;ebp-150h&#93;,xmm1
004102DB  movdqa      xmm0,xmmword ptr &#91;ebp-150h&#93;
004102E3  movdqa      xmmword ptr &#91;ebp-30h&#93;,xmm0

return  _mm_cvtsi128_si32&#40;v&#41;
+ (_mm_extract_epi16&#40;v, 4&#41; << 8&#41;
+ (_mm_cvtsi128_si32&#40;w&#41;    << 16&#41;
+ (_mm_extract_epi16&#40;w, 4&#41; << 24&#41;;
004102E8  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93;
004102ED  movd        eax,xmm0
004102F1  movdqa      xmm0,xmmword ptr &#91;ebp-20h&#93;
004102F6  pextrw      ecx,xmm0,4
004102FB  shl         ecx,8
00410300  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93;
00410305  movd        edx,xmm0
00410309  shl         edx,10h
0041030E  movdqa      xmm0,xmmword ptr &#91;ebp-30h&#93;
00410313  pextrw      ecx,xmm0,4
00410318  shl         ecx,18h
&#125;
0041031D  mov         ecx,dword ptr &#91;ebp-4&#93;
00410320  xor         ecx,ebp
00410327  mov         esp,ebp
00410329  pop         ebp
0041032A  mov         esp,ebx
0041032C  pop         ebx
0041032D  ret
``````
[/code]

Gerd Isenberg
Posts: 2208
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

### Re: Counting bits...

mcostalba wrote:
Gerd Isenberg wrote: oups, I missed the nibble add, but you got the idea how to do...
Hi Gerd,

I tried the following (there was a small compile error in yours)
Sorry, I already said I missed the nibble count so don't take the routine literally, but as a starting point for your own "research". I actually have some stress at work and am too short in time, sorry.

The _mm_sad_epu8 performs the final byte add but no nibble add, so the v = (v + (v >> 4)) & 0x0f0f0f0f0f0f0f0f for a full popcount was missing. If the intrinsics produce that horrorful assembly, you may have a look into the Problem with functions not inlining thread, to probably disable exceptions (Configuration Properties > Code Generation > Enable C++ Exceptions), otherwise use inline asm.

Posts: 647
Joined: Mon Dec 15, 2008 10:45 am

### Re: Counting bits...

Hello Marco,

as i remember you are using magicBitboards, if not just
forget what i write now.

Because you mentioned, the most important "bitcounts" are
done for the mobility, you may think about to extend the
MagicTables with a precalculated bitCountValue.
So to say, simple lookup.

Only an idea, of course memory issues are important too....

Unfortunatelly i cannot help with the xmm stuff. good luck

cheers Michael

Gerd Isenberg
Posts: 2208
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

### Re: Counting bits...

mcostalba wrote: Hi Gerd,

I tried the following (there was a small compile error in yours)

But results are not correct. Also the produced assembly I am not sure is so much cheaper then the standard version.
I put a routine on the cpw sse2 page, not restricted to max 15. The 32-bit assembly looks like that, while your posted assembly looks more like from a debug build.

Code: Select all

``````PUBLIC	?popCount4@@YAHQB_K@Z
CONST	SEGMENT
DQ 5555555555555555H
DQ 5555555555555555H
DQ 3333333333333333H
DQ 3333333333333333H
DQ 0f0f0f0f0f0f0f0fH
DQ 0f0f0f0f0f0f0f0fH
; Function compile flags&#58; /Ogtpy
CONST	ENDS
_TEXT	SEGMENT
_bb\$ = 8
?popCount4@@YAHQB_K@Z PROC
push	ebp
mov	ebp, esp
and	esp, -16
mov	eax, DWORD PTR _bb\$&#91;ebp&#93;
movdqa	xmm0, XMMWORD PTR &#91;eax&#93;
movdqa	xmm3, XMMWORD PTR &#91;eax+16&#93;
movdqa	xmm1, xmm0
psrlq	xmm0, 1
pand	xmm0, xmm4
psubb	xmm1, xmm0
movdqa	xmm0, xmm1
movdqa	xmm2, xmm3
psrlq	xmm0, 2
psrlq	xmm3, 1
pand	xmm3, xmm4
psubb	xmm2, xmm3
pand	xmm1, xmm3
pand	xmm0, xmm3
movdqa	xmm1, xmm2
pand	xmm2, xmm3
psrlq	xmm1, 2
pand	xmm1, xmm3
movdqa	xmm2, xmm0
psrlq	xmm0, 4
movdqa	xmm0, xmm1
psrlq	xmm1, 4
pand	xmm2, xmm4
pxor	xmm3, xmm3