Square Occupation vs Mobility / Square Control

mcostalba · Post by **mcostalba** » Sat Sep 26, 2009 8:01 pm

bob wrote: You only have to count the bits every 100 calls or so due to the cache trick I mentioned.

I have done osme statistic tests trying to cache the whole mobility bitboard, one for each piece and color.

The whole bitboard is cached, not splitted in rank and files to avoid a double lookup (and possibly some shifts).

But cache hit is far lower then 99% as you have in Crafty. Here we are at 65% on average and it doesn't seems to change with search depth.

I have also tried a quick speed test just to see if idea can bring something, actually there is a speed up but very very little, around 0.5% or even less. So I would think it doesn't worth the code increased complexity.

I have noted that cache hits of queen and rooks are lower then that of knights and bishops as one could expect.

BTW I have a simple 32 bit system, no i7 or other luxury stuff with POPCNT support, so I would think that on that hardware we can even have a performance loss using this trick.

bob · Post by **bob** » Sat Sep 26, 2009 8:33 pm

mcostalba wrote:
bob wrote: You only have to count the bits every 100 calls or so due to the cache trick I mentioned.
I have done osme statistic tests trying to cache the whole mobility bitboard, one for each piece and color.

The whole bitboard is cached, not splitted in rank and files to avoid a double lookup (and possibly some shifts).

I use magic bitboards so I don't do this. I just save the rank/file contents using a simple AND instruction to remove everything else.

But cache hit is far lower then 99% as you have in Crafty. Here we are at 65% on average and it doesn't seems to change with search depth.

I have also tried a quick speed test just to see if idea can bring something, actually there is a speed up but very very little, around 0.5% or even less. So I would think it doesn't worth the code increased complexity.

I have noted that cache hits of queen and rooks are lower then that of knights and bishops as one could expect.

BTW I have a simple 32 bit system, no i7 or other luxury stuff with POPCNT support, so I would think that on that hardware we can even have a performance loss using this trick.

I run on a PIV 2.8ghz for most of my development, the rest being on my core-2 duo laptop. I can detect _no_ speed loss on either, if I keep the current code, or use the old "count the squares" code like most are doing. This being true on the 32 bit PIV, and on my core-2 laptop and cluster. I don't yet have an I7 machine to use, although I have tested on one. And enabling the popcnt on the I7 was hardly a measurable speed increase.

If your overall cache hit rate is only 66 percent, something seems wrong. You have 7 pieces where you might count mobility. Successive calls to Evaluate() will likely move one of those pieces about 1/2 the time, and a pawn the rest of the time.

In any case, it is an effective idea, and weighting the values of squares you can reach is clearly better than just counting up the number of squares you can reach and stopping there. This made a significant difference in our overall Elo as we tested it on the cluster. The idea came from Tracy who implemented most of the code.

mcostalba · Post by **mcostalba** » Sat Sep 26, 2009 8:49 pm

bob wrote: If your overall cache hit rate is only 66 percent, something seems wrong. You have 7 pieces where you might count mobility. Successive calls to Evaluate() will likely move one of those pieces about 1/2 the time, and a pawn the rest of the time.

Well evaluate() is mostly called from qsearch where you test mostly only captures that affect a lot the mobility of the pieces, quiet moves are far rare where evaluation is actually called.

Anyhow this is only an hypothesis because I have not analyzed that, just some quick statistics.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Sep 26, 2009 8:52 pm

mcostalba wrote:
bob wrote: You only have to count the bits every 100 calls or so due to the cache trick I mentioned.
I have done osme statistic tests trying to cache the whole mobility bitboard, one for each piece and color.

The whole bitboard is cached, not splitted in rank and files to avoid a double lookup (and possibly some shifts).

But cache hit is far lower then 99% as you have in Crafty. Here we are at 65% on average and it doesn't seems to change with search depth.

I have also tried a quick speed test just to see if idea can bring something, actually there is a speed up but very very little, around 0.5% or even less. So I would think it doesn't worth the code increased complexity.

I have noted that cache hits of queen and rooks are lower then that of knights and bishops as one could expect.

BTW I have a simple 32 bit system, no i7 or other luxury stuff with POPCNT support, so I would think that on that hardware we can even have a performance loss using this trick.

For weights in the unsigned 0..63 range you may try the "rotated" SSE2 dot-product. Works on x86(64) in 64-bit and 32-bit mode as well.

Code: Select all

/* for average weights < 64 */
int dotProduct64(U64 bb, BYTE weightsRot90[] /* XMM_ALIGN */)
{
   static const U64 CACHE_ALIGN masks[8] = {
      C64(0x0101010101010101), C64(0x0202020202020202),
      C64(0x0404040404040404), C64(0x0808080808080808),
      C64(0x1010101010101010), C64(0x2020202020202020),
      C64(0x4040404040404040), C64(0x8080808080808080),
   };
   __m128i x0, x1, x2, x3, zr; U32 cnt;
   __m128i * pM = (__m128i*) masks;
   x0  =  _mm_cvtsi64x_si128 (bb);
   x0  =  _mm_unpacklo_epi64 (x0, x0);
   zr  =  _mm_setzero_si128();
   x3  =  _mm_andnot_si128 (x0, pM[3]);
   x2  =  _mm_andnot_si128 (x0, pM[2]);
   x1  =  _mm_andnot_si128 (x0, pM[1]);
   x0  =  _mm_andnot_si128 (x0, pM[0]);
   x3  =  _mm_cmpeq_epi8   (x3, zr);
   x2  =  _mm_cmpeq_epi8   (x2, zr);
   x1  =  _mm_cmpeq_epi8   (x1, zr);
   x0  =  _mm_cmpeq_epi8   (x0, zr);
   // multiply by "and" with -1 or 0
   __m128i* pw = (__m128i*) weightsRot90;
   x3 = _mm_and_si128 (x3, pw[3]);
   x2 = _mm_and_si128 (x2, pw[2]);
   x1 = _mm_and_si128 (x1, pw[1]);
   x0 = _mm_and_si128 (x0, pw[0]);
   // add all bytes (with saturation)
   x3 = _mm_adds_epu8 (x3, x2);
   x0 = _mm_adds_epu8 (x0, x1);
   x0 = _mm_adds_epu8 (x0, x3);
   x0 = _mm_sad_epu8  (x0, zr);
   return _mm_cvtsi128_si32 (x0)
        + _mm_extract_epi16 (x0, 4);
}

mcostalba · Post by **mcostalba** » Sat Sep 26, 2009 9:00 pm

Gerd Isenberg wrote: For weights in the unsigned 0..63 range you may try the "rotated" SSE2 dot-product. Works on x86(64) in 64-bit and 32-bit mode as well.

This sounds very interesting, thanks !

Could you write, please, the corrisponding C code, just to have a clear idea of what it does.

Thanks
Marco

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Sep 26, 2009 9:15 pm

mcostalba wrote:
Gerd Isenberg wrote: For weights in the unsigned 0..63 range you may try the "rotated" SSE2 dot-product. Works on x86(64) in 64-bit and 32-bit mode as well.
This sounds very interesting, thanks !

Could you write, please, the corrisponding C code, just to have a clear idea of what it does.

Thanks
Marco

Code: Select all

int dotProduct(U64 bb, BYTE weights[])
{ 
   U64 bit = 1ULL;
   int accu = 0;
   for (int sq=0; sq < 64; sq++, bit += bit)
      if ( bb & bit) accu += weights[sq];
   return accu;
}

The SSE2 routine expands (sign extends) the bitboard into a 90 degree rotated vector of 64 bytes, containing either 0 or 255 byte for 0,1-bits. It then masks this vector with the (pre-calculated) rotated weights, and finally adds all bytes together. Guess in recent core2, K10 or I7 with 3 SSE instr. per cycles this is really fast - despite reading two cachelines from L1, ~16 cycles or even less.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Sep 26, 2009 10:02 pm

Also for msvc 6 with pro pack in inline assembly

Code: Select all

int dotProduct64(BitBoard bb, unsigned char *weightsRot90)
{
   static const BitBoard CACHE_ALIGN masks[8] = {
      0x0101010101010101, 0x0202020202020202,
      0x0404040404040404, 0x0808080808080808,
      0x1010101010101010, 0x2020202020202020,
      0x4040404040404040, 0x8080808080808080,
   };
   __asm {
      movq       xmm0, [bb]
      lea        edx,  [masks]
      mov        eax,  [weightsRot90]
      punpcklqdq xmm0, xmm0
      pxor       xmm4, xmm4	; zero
      movdqa     xmm1, xmm0
      movdqa     xmm2, xmm0
      movdqa     xmm3, xmm0
      pandn      xmm0, [edx+0*16]
      pandn      xmm1, [edx+1*16]
      pandn      xmm2, [edx+2*16]
      pandn      xmm3, [edx+3*16]
      pcmpeqb    xmm0, xmm4
      pcmpeqb    xmm1, xmm4
      pcmpeqb    xmm2, xmm4
      pcmpeqb    xmm3, xmm4
      pand       xmm0, [eax+0*16] ; and with weights
      pand       xmm1, [eax+1*16]
      pand       xmm2, [eax+2*16]
      pand       xmm3, [eax+3*16]
      paddusb    xmm0, xmm1 ; add all bytes (with saturation)
      paddusb    xmm0, xmm2
      paddusb    xmm0, xmm3
      psadbw     xmm0, xmm4 ; horizontal add 2 * 8 byte to word
      movd       edx, xmm0
      pextrw     eax, xmm0, 4
      add        eax, edx   ; final add
   }
}

bob · Post by **bob** » Sun Sep 27, 2009 2:10 am

mcostalba wrote:
bob wrote: If your overall cache hit rate is only 66 percent, something seems wrong. You have 7 pieces where you might count mobility. Successive calls to Evaluate() will likely move one of those pieces about 1/2 the time, and a pawn the rest of the time.
Well evaluate() is mostly called from qsearch where you test mostly only captures that affect a lot the mobility of the pieces, quiet moves are far rare where evaluation is actually called.

Anyhow this is only an hypothesis because I have not analyzed that, just some quick statistics.

If you capture a piece, you do not change mobility, since the same square is occupied before/after the capture. All that changes is the source square. It is not likely this square changes the mobility for all pieces, or even most pieces.

mcostalba · Post by **mcostalba** » Sun Sep 27, 2009 2:34 pm

bob wrote: If you capture a piece, you do not change mobility, since the same square is occupied before/after the capture. All that changes is the source square. It is not likely this square changes the mobility for all pieces, or even most pieces.

Due to "stand pat" the cached mobility bitboard is two moves away from the current move, not one.

For instance you are in position P and do move m1 then call evaluate and here cache mobility. Then evaluate value allow you to cut-off, so that return immediately with a fail high score that is a fail low from the parent node P.

From P the search will try another move m2 and here you call evaluate again but then the cached mobility values refer to position P+m1 and now you are at P+m2.

So if m1 and m2 refer to move of different pieces you end up with two changed mobilities out of 6 pieces i.e. 30% and this could explain why my hit rate is about 65%.

BTW I have great doubts that you are able to reach 99% hit rate. Could you please measure it directly ?

bob · Post by **bob** » Sun Sep 27, 2009 7:08 pm

mcostalba wrote:
bob wrote: If you capture a piece, you do not change mobility, since the same square is occupied before/after the capture. All that changes is the source square. It is not likely this square changes the mobility for all pieces, or even most pieces.
Due to "stand pat" the cached mobility bitboard is two moves away from the current move, not one.

For instance you are in position P and do move m1 then call evaluate and here cache mobility. Then evaluate value allow you to cut-off, so that return immediately with a fail high score that is a fail low from the parent node P.

From P the search will try another move m2 and here you call evaluate again but then the cached mobility values refer to position P+m1 and now you are at P+m2.

So if m1 and m2 refer to move of different pieces you end up with two changed mobilities out of 6 pieces i.e. 30% and this could explain why my hit rate is about 65%.

BTW I have great doubts that you are able to reach 99% hit rate. Could you please measure it directly ?

First, your "two moves away" is simply wrong. When I enter q-search, I do a stand pat score, then try a capture, do a stand-pat, try a capture, etc. Always one intervening move except when I back up a couple of plies and start going forward again.

I don't remember the 99% hit rate quote since I have never measured it, what I do remember is that once we added this, in place of our old mobility that just summed squares and used the sum to index into a small table to compute mobility scores (like most including stockfish/glaurung do), the cost was absolutely immeasurable. Our NPS did not change one bit. Before we did the cache trick, it was more expensive as we turned one popcnt() into 4.

That's been my claim all along. It is better than just a "sum of reachable squares" approach, and can be extended to have as many different "square classes" as you want, all the way up to the Cray Blitz approach of 64 different values for the squares.

when I have time, I will add the code to count mobility cache hits vs misses, aggregated over all pieces, to get a number. If I were really analyzing, I would expect 75%. Why? Because this had no cost. And we now do 4x the work to compute mobility, yet the cost did not increase at all. If 75% of the attempts hit cache, that would be the result.

Note that when I am talking 75% hits, I mean that 3 of every 4 times I call (say) EvaluateRooks(), the mobility is unchanged from the last time it was computed because the rank/file occupancy for that rook is unchanged. That actually sounds low, but I will try to measure to see.

Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control

Re: Square Occupation vs Mobility / Square Control