New simple and fast bitboard move generator

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

New simple and fast bitboard move generator

Post by Michael Sherwin »

Code: Select all

u32 RookIndex(u32 sq, u64 occ) {
    return = indexN[sq][FirstBit(occ & bitsN[sq])]
           | indexE[sq][FirstBit(occ & bitsE[sq])]
           | indexS[sq][LastBit (occ & bitsS[sq])]
           | indexW[sq][LastBit (occ & bitsW[sq])];
}
Since, bsf/bsr is very fast on core 2 this should be very competitive, in both 32 bit mode and 64 bit mode.

A rook on a1 only needs 3 bits + 3 bits = 6 bits of index, to hold the bit placements for the two directions. A rook on e4 needs 2 + 2 + 2 + 2 = 8 bits for the four directions. Remember not to include the final destination edge square.

This has the huge advantage that no 64 bit multiplication needs to be done and no shift back into place as well. Also it requires smaller lookup tables!

The returned index can be used to look up the mobility as well as the attack set.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg »

Michael Sherwin wrote:

Code: Select all

u32 RookIndex(u32 sq, u64 occ) {
    return = indexN[sq][FirstBit(occ & bitsN[sq])]
           | indexE[sq][FirstBit(occ & bitsE[sq])]
           | indexS[sq][LastBit (occ & bitsS[sq])]
           | indexW[sq][LastBit (occ & bitsW[sq])];
}
Since, bsf/bsr is very fast on core 2 this should be very competitive, in both 32 bit mode and 64 bit mode.

A rook on a1 only needs 3 bits + 3 bits = 6 bits of index, to hold the bit placements for the two directions. A rook on e4 needs 2 + 2 + 2 + 2 = 8 bits for the four directions. Remember not to include the final destination edge square.

This has the huge advantage that no 64 bit multiplication needs to be done and no shift back into place as well. Also it requires smaller lookup tables!

The returned index can be used to look up the mobility as well as the attack set.
Nice idea - but i fear too expensive due to the nine memory reads. In total 11 64-bit instructions, four 1/2KByte lookups and four 16KByte lookups, plus the final lookup by index. Note that you have to deal with bitscanning empty sets. May be you can gain some parallel speedup due to four independent instruction chains until the final ors and lookup, but i doubt that due to the eight leading (four 16KByte) memory-reads. Bitscans have latency of two cycles on core 2 duo, reciprocal thoughput of 1 cycle, thus two "parallel" bitscans can compete in two cycles as well...

My favourite lookup-approach seems still kindergarten for rooks and magic for bishops. Kindergarten rook takes in total 13 instructions including one imul, and two cheap lookups. Magic bishop requires three cheap lookups (from one cacheline), and/imul/shift, final <38KByte lookup.

Btw. 32-bit versions of magic and kindergarten don't need the expensive 64-bit mul call, but two 32-bit imuls. Bitscanning eight 32-bit words with 32-bit bsf/bsr is also not really cheap considering empty sets...

Gerd
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg »

Assuming bsf/bsr doesn't change the target register if source is zero, and zero is never a found index here, a pre-initialization of the target-register is not necessary if it is already the source. To use only one cacheline for the occ-ands instead of four, i suggest to pack four bitboards into a struct (quad-bitboard) and to have an array of 64 quad-bitboards. Is your index unique over all squares or a char-range only per square? I tried it on the fly with 16-bit short indices, to get some assembly to inspect.

Code: Select all

u64 bits&#91;64&#93;&#91;4&#93;;
unsigned short indexN&#91;64&#93;&#91;64&#93;;
unsigned short indexE&#91;64&#93;&#91;64&#93;;
unsigned short indexS&#91;64&#93;&#91;64&#93;;
unsigned short indexW&#91;64&#93;&#91;64&#93;;
u64 rookLookup&#91;...&#93;;

__forceinline
u64 rookAttacks&#40;u64 occ, u32 sq&#41; &#123;
    u64 n = occ & bits&#91;sq&#93;&#91;0&#93;;
    u64 e = occ & bits&#91;sq&#93;&#91;1&#93;;
    u64 s = occ & bits&#91;sq&#93;&#91;2&#93;;
    u64 w = occ & bits&#91;sq&#93;&#91;3&#93;;
    _BitScanForward64&#40;&#40;unsigned long*)&n, n&#41;;
    _BitScanForward64&#40;&#40;unsigned long*)&e, e&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&s, s&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&w, w&#41;;
    u32 idx = indexN&#91;sq&#93;&#91;n&#93; | indexE&#91;sq&#93;&#91;e&#93; | indexS&#91;sq&#93;&#91;s&#93; | indexW&#91;sq&#93;&#91;w&#93;;
    return rookLookup&#91;idx&#93;;
&#125;


occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KI@Z PROC
  00000	4c 8d 1d 00 00
	00 00		 lea	 r11, OFFSET FLAT&#58;__ImageBase
  00007	4c 8b c9	 mov	 r9, rcx
  0000a	8b c2		 mov	 eax, edx
  0000c	48 c1 e0 05	 shl	 rax, 5
  00010	44 8b d2	 mov	 r10d, edx
  00013	4e 8b 84 18 00
	00 00 00	 mov	 r8, QWORD PTR ?bits&#91;rax+r11&#93;
  0001b	4a 8b 94 18 08
	00 00 00	 mov	 rdx, QWORD PTR ?bits&#91;rax+r11+8&#93;
  00023	49 c1 e2 06	 shl	 r10, 6
  00027	48 23 d1	 and	 rdx, rcx
  0002a	4c 23 c1	 and	 r8, rcx
  0002d	4a 8b 8c 18 10
	00 00 00	 mov	 rcx, QWORD PTR ?bits&#91;rax+r11+16&#93;
  00035	4a 8b 84 18 18
	00 00 00	 mov	 rax, QWORD PTR ?bits&#91;rax+r11+24&#93;
  0003d	49 23 c9	 and	 rcx, r9
  00040	49 23 c1	 and	 rax, r9
  00043	48 0f bd c9	 bsr	 rcx, rcx
  00047	4d 0f bc c8	 bsf	 r9, r8
  0004b	4c 0f bc c2	 bsf	 r8, rdx
  0004f	48 0f bd c0	 bsr	 rax, rax
  00053	49 03 c2	 add	 rax, r10
  00056	41 0f b7 94 43
	00 00 00 00	 movzx	 edx, WORD PTR ?indexW&#91;r11+rax*2&#93;
  0005f	49 8d 04 0a	 lea	 rax, QWORD PTR &#91;r10+rcx&#93;
  00063	4b 8d 0c 02	 lea	 rcx, QWORD PTR &#91;r10+r8&#93;
  00067	41 0f b7 8c 4b
	00 00 00 00	 movzx	 ecx, WORD PTR ?indexE&#91;r11+rcx*2&#93;
  00070	41 0f b7 84 43
	00 00 00 00	 movzx	 eax, WORD PTR ?indexS&#91;r11+rax*2&#93;
  00079	48 0b d0	 or	 rdx, rax
  0007c	48 0b d1	 or	 rdx, rcx
  0007f	4b 8d 0c 0a	 lea	 rcx, QWORD PTR &#91;r10+r9&#93;
  00083	41 0f b7 8c 4b
	00 00 00 00	 movzx	 ecx, WORD PTR ?indexN&#91;r11+rcx*2&#93;
  0008c	48 0b d1	 or	 rdx, rcx
  0008f	49 8b 84 d3 00
	00 00 00	 mov	 rax, QWORD PTR ?rookLookup&#91;r11+rdx*8&#93;
  00097	c3		 ret	 0
?rookAttacks@@YA_K_KI@Z ENDP
Quite modest register usage and the independent instruction chains until the middle of the routine. May be it is improvable - at least code size wise in writing assembly. But again, so far you need to transer six cachelines by nine memory reads, instaed of two in kindergarten-bitboards out of 4.5 KByte. It would be nice to combine the four index-arrays to one, since a found 6-inner blocking square, already implies the attack direction as well. But in the case of no blocking square is found by bsf/bsr, one needs the direction.

For comparison - kindergarten rook-attacks. Shifts by cl seems the bottleneck to probably perform better ipc - but one has to consider shadow-registers and OOOE. The one-dimensional firstRankAttacks2-array saves one C-instruction with explicit *4 scaling instead of implicit *8. The resulting code was shorter and even saved a register with vc2005. Unfortunately the similar trick doesn't work to save the shr 3 with a one-dimensional aFileAttacks[8*64] array.

Code: Select all

extern u64 CACHE_ALIGN aFileAttacks&#91;64&#93;&#91;8&#93;;     // 4 KByte
extern u8  CACHE_ALIGN firstRankAttacks2&#91;64*8&#93;; // 0.5 KByte


u64 rookAttacks&#40;u64 occ, u32 sq&#41; &#123;
    u32 k = sq &  7;
    u32 i = sq & 56;
    u32 n = &#40;u32&#41;&#40;occ >> i&#41;    & 2*63;
    u64 d = firstRankAttacks2&#91;4*n+k&#93;; // &#91;n/2&#93;&#91;k&#93;
    u64 e = 0x0101010101010101 & &#40;occ >> k&#41;;
    u64 r = 0x0080402010080400 *  e   >> 58;
    u64 garten  =   aFileAttacks&#91;r&#93;&#91;i >> 3&#93;;
    return       (  d << i&#41; | &#40;garten << k&#41;;    
&#125;

occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KI@Z PROC
  00000	44 8b c2	 mov	 r8d, edx
  00003	83 e2 38	 and	 edx, 56
  00006	4c 8b d1	 mov	 r10, rcx
  00009	44 8b ca	 mov	 r9d, edx
  0000c	49 8b d2	 mov	 rdx, r10
  0000f	41 83 e0 07	 and	 r8d, 7
  00013	41 8b c8	 mov	 ecx, r8d
  00016	48 b8 01 01 01
	01 01 01 01 01	 mov	 rax, 0101010101010101H
  00020	4c 8d 1d 00 00
	00 00		 lea	 r11, OFFSET FLAT&#58;__ImageBase
  00027	48 d3 ea	 shr	 rdx, cl
  0002a	48 23 d0	 and	 rdx, rax
  0002d	48 b8 00 04 08
	10 20 40 80 00	 mov	 rax, 0080402010080400H
  00037	48 0f af d0	 imul	 rdx, rax
  0003b	49 8b c1	 mov	 rax, r9
  0003e	48 c1 e8 03	 shr	 rax, 3
  00042	48 c1 ea 3a	 shr	 rdx, 58
  00046	48 8d 14 d0	 lea	 rdx, QWORD PTR &#91;rax+rdx*8&#93;
  0004a	49 8b 84 d3 00
	00 00 00	 mov	 rax, QWORD PTR ?aFileAttacks&#91;r11+rdx*8&#93;
  00052	48 d3 e0	 shl	 rax, cl
  00055	49 8b c9	 mov	 rcx, r9
  00058	49 d3 ea	 shr	 r10, cl
  0005b	41 83 e2 7e	 and	 r10d, 126
  0005f	43 8d 14 90	 lea	 edx, DWORD PTR &#91;r8+r10*4&#93;
  00063	42 0f b6 94 1a
	00 00 00 00	 movzx	 edx, BYTE PTR ?firstRankAttacks2&#91;rdx+r11&#93;
  0006c	48 d3 e2	 shl	 rdx, cl
  0006f	48 0b c2	 or	 rax, rdx
  00072	c3		 ret	 0
Gerd
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: New simple and fast bitboard move generator

Post by Michael Sherwin »

Is your index unique over all squares or a char-range only per square?
The address of the base of the subtable in which to look up the attack set (or mobility) is meant to be included in at least one of the four lookups. I thought that, that would be best to save a couple instructions. Also I do use structures to hold the information for each square in one place to best utilyze the cash lines. The code I posted was just meant to be simple. However, what I did not think of was breaking it down into smaller chains of instructions that might be better ran in paralell! From examining the assembly it looks as though it all depends on the drag of the memory reads vs that of the imul's and shifts. Thanks!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg »

In assembly i would try something like this, to use a shared index array with preinitialized registers {0(implicit), 7, 56, 63} to distinguish between different directions if no inner blocking square is found by bitscan. That takes only max two cachelines (if properly 64-byte cacheline-aligned) for the four 2-byte index lookups:

Code: Select all

_rookAttacks PROC
  shl  ecx, 5 ; sq*64

  mov  rax, rdx ; occ
  mov  r8, rdx
  mov  r9, rdx
  mov  r10d, 7
  mov  r11d, 56

  and  rax, QWORD PTR bits&#91;rcx&#93;
  and  r8,  QWORD PTR bits&#91;rcx + 8&#93;
  and  r9,  QWORD PTR bits&#91;rcx + 16&#93;
  and  rdx, QWORD PTR bits&#91;rcx + 24&#93;

  add  ecx, ecx ; *2 assumes short lookop

  bsf  rax, rax
  bsf  r10,  r8
  mov  r8d, 63
  bsr  r11,  r9
  bsr  r8, rdx

  movzx eax,WORD PTR index&#91;rcx + 2*rax&#93; ; *2 assumes short lookop
  or    ax, WORD PTR index&#91;rcx + 2*r10&#93; 
  or    ax, WORD PTR index&#91;rcx + 2*r11&#93; 
  or    ax, WORD PTR index&#91;rcx + 2*r8&#93; 
  
  mov   rax, QWORD PTR rookLookup&#91;8*rax&#93;
  ret	0
_rookAttacks ENDP

But if i try this C-Code, the generated assembly unfortunately looks very ugly.

Code: Select all

u64 rookAttacks&#40;u64 occ, u32 sq&#41; &#123;
    unsigned long ln = 0, le = 7, ls = 56, lw = 63;
    _BitScanForward64&#40;&ln, occ & bits&#91;sq&#93;&#91;0&#93;);
    _BitScanForward64&#40;&le, occ & bits&#91;sq&#93;&#91;1&#93;);
    _BitScanReverse64&#40;&ls, occ & bits&#91;sq&#93;&#91;2&#93;);
    _BitScanReverse64&#40;&lw, occ & bits&#91;sq&#93;&#91;3&#93;);
    u32 idx = index&#91;sq&#93;&#91;ln&#93; | index&#91;sq&#93;&#91;le&#93; | index&#91;sq&#93;&#91;ls&#93; | index&#91;sq&#93;&#91;lw&#93;;
    return rookLookup&#91;idx&#93;;
&#125;
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: New simple and fast bitboard move generator

Post by Michael Sherwin »

Hi Gerd,

That assembly looks fantastic!

Thanks again! :D

Mike
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg »

I am still not quite sure whether it is guaranteed to use bitscan that way on a core 2 duo. For older processors intel explicitly stated that the destination register might be undefined after bsf/bsr with zero source, while Amd makes it quite clear:
AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and System
Instructions
Page 74

BSF Bit Scan Forward

Searches the value in a register or a memory location (second operand) for the leastsignificant set bit. If a set bit is found, the instruction clears the zero flag (ZF) and stores the index of the least-significant set bit in a destination register (first operand). If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. The bit index is an unsigned offset from bit 0 of the searched value.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: New simple and fast bitboard move generator

Post by Michael Sherwin »

Gerd Isenberg wrote:I am still not quite sure whether it is guaranteed to use bitscan that way on a core 2 duo. For older processors intel explicitly stated that the destination register might be undefined after bsf/bsr with zero source, while Amd makes it quite clear:
AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and System
Instructions
Page 74

BSF Bit Scan Forward

Searches the value in a register or a memory location (second operand) for the leastsignificant set bit. If a set bit is found, the instruction clears the zero flag (ZF) and stores the index of the least-significant set bit in a destination register (first operand). If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. The bit index is an unsigned offset from bit 0 of the searched value.
I would hope that Intel was a bit smarter this time round!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: New simple and fast bitboard move generator

Post by Michael Sherwin »

Would this be any better?

Code: Select all

typedef struct &#123;
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  u32 idxN&#91;64&#93;;
  u32 idxE&#91;64&#93;;
  u32 idxS&#91;64&#93;;
  u32 idxW&#91;64&#93;;
&#125; square;

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123;
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;; 
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;; 
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;; 
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;; 
    u32 idx = sq->idxN&#91;ln&#93; | sq->idxE&#91;le&#93; | sq->idxS&#91;ls&#93; | sq->idxW&#91;lw&#93;; 
    return rookLookup&#91;idx&#93;; 
Edit: Or rather.

Code: Select all

typedef struct &#123;
  u32 idxN;
  u32 idxE;
  u32 idxS;
  u32 idxW;
&#125; dir;

typedef struct &#123;
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  dir idx&#91;64&#93;;
&#125; square;

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123;
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;; 
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;; 
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;; 
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;; 
    u32 idx = sq->idx&#91;ln&#93;.idxN | sq->idx&#91;le&#93;.idxE | sq->idx&#91;ls&#93;.idxS | sq->idx&#91;lw&#93;.idxW; 
    return rookLookup&#91;idx&#93;; 
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Tony

Re: New simple and fast bitboard move generator

Post by Tony »

Gerd Isenberg wrote:
Michael Sherwin wrote:

Code: Select all

u32 RookIndex&#40;u32 sq, u64 occ&#41; &#123;
    return = indexN&#91;sq&#93;&#91;FirstBit&#40;occ & bitsN&#91;sq&#93;)&#93;
           | indexE&#91;sq&#93;&#91;FirstBit&#40;occ & bitsE&#91;sq&#93;)&#93;
           | indexS&#91;sq&#93;&#91;LastBit &#40;occ & bitsS&#91;sq&#93;)&#93;
           | indexW&#91;sq&#93;&#91;LastBit &#40;occ & bitsW&#91;sq&#93;)&#93;;
&#125;
Since, bsf/bsr is very fast on core 2 this should be very competitive, in both 32 bit mode and 64 bit mode.

A rook on a1 only needs 3 bits + 3 bits = 6 bits of index, to hold the bit placements for the two directions. A rook on e4 needs 2 + 2 + 2 + 2 = 8 bits for the four directions. Remember not to include the final destination edge square.

This has the huge advantage that no 64 bit multiplication needs to be done and no shift back into place as well. Also it requires smaller lookup tables!

The returned index can be used to look up the mobility as well as the attack set.
Nice idea - but i fear too expensive due to the nine memory reads. In total 11 64-bit instructions, four 1/2KByte lookups and four 16KByte lookups, plus the final lookup by index. Note that you have to deal with bitscanning empty sets. May be you can gain some parallel speedup due to four independent instruction chains until the final ors and lookup, but i doubt that due to the eight leading (four 16KByte) memory-reads. Bitscans have latency of two cycles on core 2 duo, reciprocal thoughput of 1 cycle, thus two "parallel" bitscans can compete in two cycles as well...

My favourite lookup-approach seems still kindergarten for rooks and magic for bishops. Kindergarten rook takes in total 13 instructions including one imul, and two cheap lookups. Magic bishop requires three cheap lookups (from one cacheline), and/imul/shift, final <38KByte lookup.

Btw. 32-bit versions of magic and kindergarten don't need the expensive 64-bit mul call, but two 32-bit imuls. Bitscanning eight 32-bit words with 32-bit bsf/bsr is also not really cheap considering empty sets...

Gerd
How about something like:

Code: Select all


u32 RookIndex&#40;u32 sq, u64 occ&#41; &#123;
    return = indexN&#91;File&#40;sq&#41;&#93;&#91;Rank&#40;FirstBit&#40;occ & bitsN&#91;sq&#93;))&#93;<<&#40;Rank&#40;sq&#41;<<3&#41;
           | &#40;indexE&#91;Rank&#40;sq&#41;&#93;&#91;File&#40;FirstBit&#40;occ & bitsE&#91;sq&#93;))&#93;>>&#40;File&#40;sq&#41;) & rankMask&#40;Rank&#40;sq&#41;))
           .....

&#125;
The idea is that for a rook on e4 and a piece on e6, we find the bitboard of
e1 and e3 and shift it up.

I don't know if the smaller lookuptables outweigh the extra shifts.

EDIT: We could also shift for the file, treating everything as being on the a-File.


Tony