Kindergarten bitboards without multiplying

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Piotr Cichy
Posts: 75
Joined: Sun Jul 30, 2006 11:13 pm
Location: Kalisz, Poland

Kindergarten bitboards without multiplying

Post by Piotr Cichy »

In Kindergarten bitboards we use 64-bit multiplying and 64-bit shift for getting mask of 6 necessary bits from Occupied bitboard. This mask is then used as index for precalculated table.

For computing masks for square f we use something like:

Code: Select all

   #define FILE_A 0x0101010101010101
   #define FILE_B 0x0202020202020202
   #define D_C2H7 0x0080402010080400

   #define MaskV(f) int(((Occupied>>(f&7))&FILE_A) * D_C2H7 >> 58)   // mask for vertical moves
   #define MaskH(f) int((Occupied&Mask_H[f]) * FILE_B >> 58)         // mask for horizontal moves
   #define MaskD(f) int((Occupied&Mask_D[f]) * FILE_B >> 58)         // mask for diagonal moves
   #define MaskA(f) int((Occupied&Mask_A[f]) * FILE_B >> 58)         // mask for antidiagonal moves
These masks are used for generating attacks:

Code: Select all

   #define AttacksV&#40;f&#41;       MovesF&#91;f>>3&#93;&#91;MaskV&#40;f&#41;&#93;<<&#40;f&7&#41;
   #define AttacksH&#40;f&#41;       MovesR&#91;f&7 &#93;&#91;MaskH&#40;f&#41;&#93;&Mask_H&#91;f&#93;
   #define AttacksD&#40;f&#41;       MovesR&#91;f&7 &#93;&#91;MaskD&#40;f&#41;&#93;&Mask_D&#91;f&#93;
   #define AttacksA&#40;f&#41;       MovesR&#91;f&7 &#93;&#91;MaskA&#40;f&#41;&#93;&Mask_A&#91;f&#93;

   #define RookAttacks&#40;f&#41;    &#40;AttacksV&#40;f&#41;|AttacksH&#40;f&#41;)
   #define BishopAttacks&#40;f&#41;  &#40;AttacksD&#40;f&#41;|AttacksA&#40;f&#41;)
   #define QueenAttacks&#40;f&#41;   &#40;RookAttacks&#40;f&#41;|BishopAttacks&#40;f&#41;)

In 32-bit environment computing masks by 64-bit multiplying is rather slow, but we can replace multiplying with some 32-bit shift and bit or operations.

First, we can define functions PackV and PackH, which pack 64-bit bitboard into 8-bit bitboard, where Nth bit is set if Nth file (or rank) of original bitboard has any square set. From this 8-bit mask we get inner 6 bits.

Code: Select all

   inline int PackV&#40;U64 b&#41;
   &#123;
     int m=int&#40;b>>32&#41;;
     m=&#40;m<<4&#41;|int&#40;b&#41;;
     m|=m>>14;
     m|=m>>7;
     return &#40;m&126&#41;>>1;
   &#125;

   inline int PackH&#40;U64 b&#41;
   &#123;
     int m=int&#40;b>>32&#41;|int&#40;b&#41;;
     m|=m>>16;
     m|=m>>8;
     return &#40;m&126&#41;>>1;
   &#125;
Note, that function PackV works only for bitboards with bits only on file A!

Now we can modify computing masks into:

Code: Select all

   #define MaskV&#40;f&#41; PackV&#40;&#40;Occupied>>&#40;f&7&#41;)&FILE_A&#41;
   #define MaskH&#40;f&#41; PackH&#40;Occupied&Mask_H&#91;f&#93;))   
   #define MaskD&#40;f&#41; PackH&#40;Occupied&Mask_D&#91;f&#93;))
   #define MaskA&#40;f&#41; PackH&#40;Occupied&Mask_A&#91;f&#93;))
On my Athlon64 this method is almost 3x faster!

There is yet small further optimisation.
If we remove >>1 from functions PackV and PackH, we get the index multiplied by 2. By interleaving tables with even indexes for ranks and odd for files we can use then this index directly as index+1 for files and index for ranks. Note, that compiler should optimise "tab[index+1]" into "(tab+1)[index]" and (tab+1) is constant at compilation time.

So finally masks and attacks are computed in following way:

Code: Select all

   inline int PackV&#40;U64 b&#41;
   &#123;
     int m=int&#40;b>>32&#41;;
     m=&#40;m<<4&#41;|int&#40;b&#41;;
     m|=m>>14;
     m|=m>>7;
     return m&126;
   &#125;

   inline int PackH&#40;U64 b&#41;
   &#123;
     int m=int&#40;b>>32&#41;|int&#40;b&#41;;
     m|=m>>16;
     m|=m>>8;
     return m&126;
   &#125;

   #define MaskV&#40;f&#41; PackV&#40;&#40;Occupied>>&#40;f&7&#41;)&FILE_A&#41;+1
   #define MaskH&#40;f&#41; PackH&#40;Occupied&Mask_H&#91;f&#93;))   
   #define MaskD&#40;f&#41; PackH&#40;Occupied&Mask_D&#91;f&#93;))
   #define MaskA&#40;f&#41; PackH&#40;Occupied&Mask_A&#91;f&#93;))

   #define AttacksV&#40;f&#41;       MovesFR&#91;f>>3&#93;&#91;MaskV&#40;f&#41;&#93;<<&#40;f&7&#41;
   #define AttacksH&#40;f&#41;       MovesFR&#91;f&7 &#93;&#91;MaskH&#40;f&#41;&#93;&Mask_H&#91;f&#93;
   #define AttacksD&#40;f&#41;       MovesFR&#91;f&7 &#93;&#91;MaskD&#40;f&#41;&#93;&Mask_D&#91;f&#93;
   #define AttacksA&#40;f&#41;       MovesFR&#91;f&7 &#93;&#91;MaskA&#40;f&#41;&#93;&Mask_A&#91;f&#93;

   #define RookAttacks&#40;f&#41;    &#40;AttacksV&#40;f&#41;|AttacksH&#40;f&#41;)
   #define BishopAttacks&#40;f&#41;  &#40;AttacksD&#40;f&#41;|AttacksA&#40;f&#41;)
   #define QueenAttacks&#40;f&#41;   &#40;RookAttacks&#40;f&#41;|BishopAttacks&#40;f&#41;)
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Hi Piotr,
congrats, nice trick to apply a 32-bit shift in PackV aka collapsedRanksIndex. Guess your speed assessment (almost 3x faster) refers to 32-bit mode, where the shift right 32 is a noop as already available inside a 32-bit register, compared to a 64-bit multiplication in 32-bit mode. I once compared the shifts with mul in 64-bit mode and found mul clearly faster. Do you measure cycles with RDTSC or a simple loop test?

Cheers,
Gerd
Piotr Cichy
Posts: 75
Joined: Sun Jul 30, 2006 11:13 pm
Location: Kalisz, Poland

Re: Kindergarten bitboards without multiplying

Post by Piotr Cichy »

I measured both the cycles of single (inlined) call to PackV/PackH
and the cycles of loop calling PackV/PackH.

For single call I got 29 (for mul) and 10 (for Pack).
For loops (100000000 calls) I got: 1 728 175 538 and 608 045 461.

The time was measured with simple use of RDTSC:

inline __int64 Ticks()
{
__int64 x;
_asm {
rdtsc
mov dword ptr x+4,edx
mov dword ptr x,eax
}
return x;
}

I don't know how exact this is, but should be much better than GetTickCount() of Windows API.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

You should use QueryPerformanceCounter. Your code is broken on SMP machines.

In any case there is no need to do 64 bit multiplies with Kindergarten.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Kindergarten bitboards without multiplying

Post by wgarvin »

I wonder if anyone has tried the PMOVMSKB instruction for files? It should be a pretty fast way to collect the bits. You could PCMPEQB with zero, and then PMOVMSKB. There's some additional cost to get it into an XMM register, but the result goes into a general-purpose register.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Piotr Cichy wrote:I measured both the cycles of single (inlined) call to PackV/PackH
and the cycles of loop calling PackV/PackH.

For single call I got 29 (for mul) and 10 (for Pack).
For loops (100000000 calls) I got: 1 728 175 538 and 608 045 461.

The time was measured with simple use of RDTSC:

inline __int64 Ticks()
{
__int64 x;
_asm {
rdtsc
mov dword ptr x+4,edx
mov dword ptr x,eax
}
return x;
}

I don't know how exact this is, but should be much better than GetTickCount() of Windows API.
Sure, GetTickCount() resolution is far to coarse for that purpose, but you may try cpuid with eax zero before rdtsc, to make sure it doesn't executed out of order.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

wgarvin wrote:I wonder if anyone has tried the PMOVMSKB instruction for files? It should be a pretty fast way to collect the bits. You could PCMPEQB with zero, and then PMOVMSKB. There's some additional cost to get it into an XMM register, but the result goes into a general-purpose register.
Yes, I think on core2 and K10 a PMOVMSKB approach is very competitive.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

Gerd Isenberg wrote:
Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Ah, really? I would have expected that the 32 bit folded kindergarten (1 mul per direction) would be faster than a series of dependent shifts.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

Gerd Isenberg wrote: Sure, GetTickCount() resolution is far to coarse for that purpose, but you may try cpuid with eax zero before rdtsc, to make sure it doesn't executed out of order.
Still doesn't guarantee it stays on the same CPU. Again: use QueryPerformanceCounter or get burned.