New simple and fast bitboard move generator

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Sun May 06, 2007 1:21 pm

Hi Gerd,

That assembly looks fantastic!

Thanks again! :D

Mike
Regards,
Mike

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Sun May 06, 2007 1:52 pm

I am still not quite sure whether it is guaranteed to use bitscan that way on a core 2 duo. For older processors intel explicitly stated that the destination register might be undefined after bsf/bsr with zero source, while Amd makes it quite clear:
AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and System
Instructions
Page 74

BSF Bit Scan Forward

Searches the value in a register or a memory location (second operand) for the leastsignificant set bit. If a set bit is found, the instruction clears the zero flag (ZF) and stores the index of the least-significant set bit in a destination register (first operand). If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. The bit index is an unsigned offset from bit 0 of the searched value.

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Sun May 06, 2007 3:15 pm

Gerd Isenberg wrote:I am still not quite sure whether it is guaranteed to use bitscan that way on a core 2 duo. For older processors intel explicitly stated that the destination register might be undefined after bsf/bsr with zero source, while Amd makes it quite clear:
AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and System
Instructions
Page 74

BSF Bit Scan Forward

Searches the value in a register or a memory location (second operand) for the leastsignificant set bit. If a set bit is found, the instruction clears the zero flag (ZF) and stores the index of the least-significant set bit in a destination register (first operand). If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. The bit index is an unsigned offset from bit 0 of the searched value.
I would hope that Intel was a bit smarter this time round!
Regards,
Mike

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Sun May 06, 2007 10:29 pm

Would this be any better?

Code: Select all

typedef struct {
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  u32 idxN[64];
  u32 idxE[64];
  u32 idxS[64];
  u32 idxW[64];
} square;

u64 rookAttacks(u64 occ, square *sq) {
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64(&ln, occ & sq->bitsN); 
    _BitScanForward64(&le, occ & sq->bitsE); 
    _BitScanReverse64(&ls, occ & sq->bitsS); 
    _BitScanReverse64(&lw, occ & sq->bitsW); 
    u32 idx = sq->idxN[ln] | sq->idxE[le] | sq->idxS[ls] | sq->idxW[lw]; 
    return rookLookup[idx]; 
Edit: Or rather.

Code: Select all

typedef struct {
  u32 idxN;
  u32 idxE;
  u32 idxS;
  u32 idxW;
} dir;

typedef struct {
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  dir idx[64];
} square;

u64 rookAttacks(u64 occ, square *sq) {
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64(&ln, occ & sq->bitsN); 
    _BitScanForward64(&le, occ & sq->bitsE); 
    _BitScanReverse64(&ls, occ & sq->bitsS); 
    _BitScanReverse64(&lw, occ & sq->bitsW); 
    u32 idx = sq->idx[ln].idxN | sq->idx[le].idxE | sq->idx[ls].idxS | sq->idx[lw].idxW; 
    return rookLookup[idx]; 
Regards,
Mike

Tony

Re: New simple and fast bitboard move generator

Post by Tony » Mon May 07, 2007 6:46 am

Gerd Isenberg wrote:
Michael Sherwin wrote:

Code: Select all

u32 RookIndex(u32 sq, u64 occ) {
    return = indexN[sq][FirstBit(occ & bitsN[sq])]
           | indexE[sq][FirstBit(occ & bitsE[sq])]
           | indexS[sq][LastBit (occ & bitsS[sq])]
           | indexW[sq][LastBit (occ & bitsW[sq])];
}
Since, bsf/bsr is very fast on core 2 this should be very competitive, in both 32 bit mode and 64 bit mode.

A rook on a1 only needs 3 bits + 3 bits = 6 bits of index, to hold the bit placements for the two directions. A rook on e4 needs 2 + 2 + 2 + 2 = 8 bits for the four directions. Remember not to include the final destination edge square.

This has the huge advantage that no 64 bit multiplication needs to be done and no shift back into place as well. Also it requires smaller lookup tables!

The returned index can be used to look up the mobility as well as the attack set.
Nice idea - but i fear too expensive due to the nine memory reads. In total 11 64-bit instructions, four 1/2KByte lookups and four 16KByte lookups, plus the final lookup by index. Note that you have to deal with bitscanning empty sets. May be you can gain some parallel speedup due to four independent instruction chains until the final ors and lookup, but i doubt that due to the eight leading (four 16KByte) memory-reads. Bitscans have latency of two cycles on core 2 duo, reciprocal thoughput of 1 cycle, thus two "parallel" bitscans can compete in two cycles as well...

My favourite lookup-approach seems still kindergarten for rooks and magic for bishops. Kindergarten rook takes in total 13 instructions including one imul, and two cheap lookups. Magic bishop requires three cheap lookups (from one cacheline), and/imul/shift, final <38KByte lookup.

Btw. 32-bit versions of magic and kindergarten don't need the expensive 64-bit mul call, but two 32-bit imuls. Bitscanning eight 32-bit words with 32-bit bsf/bsr is also not really cheap considering empty sets...

Gerd
How about something like:

Code: Select all


u32 RookIndex&#40;u32 sq, u64 occ&#41; &#123;
    return = indexN&#91;File&#40;sq&#41;&#93;&#91;Rank&#40;FirstBit&#40;occ & bitsN&#91;sq&#93;))&#93;<<&#40;Rank&#40;sq&#41;<<3&#41;
           | &#40;indexE&#91;Rank&#40;sq&#41;&#93;&#91;File&#40;FirstBit&#40;occ & bitsE&#91;sq&#93;))&#93;>>&#40;File&#40;sq&#41;) & rankMask&#40;Rank&#40;sq&#41;))
           .....

&#125;
The idea is that for a rook on e4 and a piece on e6, we find the bitboard of
e1 and e3 and shift it up.

I don't know if the smaller lookuptables outweigh the extra shifts.

EDIT: We could also shift for the file, treating everything as being on the a-File.


Tony

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 8:40 am

Michael Sherwin wrote:Would this be any better?

Code: Select all

typedef struct &#123;
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  u32 idxN&#91;64&#93;;
  u32 idxE&#91;64&#93;;
  u32 idxS&#91;64&#93;;
  u32 idxW&#91;64&#93;;
&#125; square;

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123;
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;; 
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;; 
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;; 
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;; 
    u32 idx = sq->idxN&#91;ln&#93; | sq->idxE&#91;le&#93; | sq->idxS&#91;ls&#93; | sq->idxW&#91;lw&#93;; 
    return rookLookup&#91;idx&#93;; 
Edit: Or rather.

Code: Select all

typedef struct &#123;
  u32 idxN;
  u32 idxE;
  u32 idxS;
  u32 idxW;
&#125; dir;

typedef struct &#123;
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  dir idx&#91;64&#93;;
&#125; square;

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123;
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;; 
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;; 
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;; 
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;; 
    u32 idx = sq->idx&#91;ln&#93;.idxN | sq->idx&#91;le&#93;.idxE | sq->idx&#91;ls&#93;.idxS | sq->idx&#91;lw&#93;.idxW; 
    return rookLookup&#91;idx&#93;; 

Yes, the pointer may safe some leas. Why did you use redundant indices for the four directions? If a blocking square is found, as well if not found, the preinitialized squares (0,7,56,63) already imply the direction.

Code: Select all

typedef struct &#123;
  u16 idx&#91;64&#93;;
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
&#125; square;

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123;
    unsigned long ln = 0, le = 7, ls = 56, lw = 63;
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;;
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;;
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;;
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;;
    u32 idx = sq->idx&#91;ln&#93; | sq->idx&#91;le&#93; | sq->idx&#91;ls&#93; | sq->idx&#91;lw&#93;;
    return rookLookup&#91;idx&#93;;
&#125;
I am too short in time now - what is saw yesterday was quite strange - the vc2005 initilalized some variables on the stack, but failed to load it to the destination registers before the bitscans!? I will have a closer look later...

Gerd

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 8:51 am

Tony wrote: How about something like:

Code: Select all


u32 RookIndex&#40;u32 sq, u64 occ&#41; &#123;
    return = indexN&#91;File&#40;sq&#41;&#93;&#91;Rank&#40;FirstBit&#40;occ & bitsN&#91;sq&#93;))&#93;<<&#40;Rank&#40;sq&#41;<<3&#41;
           | &#40;indexE&#91;Rank&#40;sq&#41;&#93;&#91;File&#40;FirstBit&#40;occ & bitsE&#91;sq&#93;))&#93;>>&#40;File&#40;sq&#41;) & rankMask&#40;Rank&#40;sq&#41;))
           .....

&#125;
The idea is that for a rook on e4 and a piece on e6, we find the bitboard of
e1 and e3 and shift it up.

I don't know if the smaller lookuptables outweigh the extra shifts.

EDIT: We could also shift for the file, treating everything as being on the a-File.

Tony
Of course that is possible for shorter tables, but you bump up the 9 memory accesses with shifts by cl. That will make it terribly slow, since it foils the hope for parallel computation of all four directions.

Gerd

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Mon May 07, 2007 11:06 am

Hi Gerd,
Why did you use redundant indices for the four directions?
To put it simple, this stuff is over my head. I have to struggle mightilly to make any of my ideas work. So, to understand someone elses ideas and code is near impossible for me--I'm not exagerating. But, I think that I understand your changes better now.

Mike
Regards,
Mike

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 8:08 pm

Gerd Isenberg wrote:
I am too short in time now - what is saw yesterday was quite strange - the vc2005 initilalized some variables on the stack, but failed to load it to the destination registers before the bitscans!? I will have a closer look later...

Gerd
It does not work, if you look to the assembly:

The vc2005 compiler initializes the same local variable on the stack with 0,7,56,63 but fails to initialize the destination registers of the bitscans. As a result of this unspecified usage of bsf/bsr the complete function is serialized!

Code: Select all

typedef struct
&#123;
    u64 bits&#91;4&#93;;
    unsigned short index&#91;64&#93;;
&#125; SQUARE;

u64 rookAttacks&#40;u64 occ, SQUARE *sq&#41; &#123;
    unsigned long n = 0, e = 7, s = 56, w = 63;
    _BitScanForward64&#40;&n, occ & sq->bits&#91;0&#93;);
    _BitScanForward64&#40;&e, occ & sq->bits&#91;1&#93;);
    _BitScanReverse64&#40;&s, occ & sq->bits&#91;2&#93;);
    _BitScanReverse64&#40;&w, occ & sq->bits&#91;3&#93;);
     u32 idx = sq->index&#91;n&#93; | sq->index&#91;e&#93; | sq->index&#91;s&#93; | sq->index&#91;w&#93;;
     return rookLookup&#91;idx&#93;;
&#125;


_TEXT	SEGMENT
w$ = 8
s$ = 8
e$ = 8
n$ = 8
occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z PROC
  00000	48 8b 02	 mov	 rax, QWORD PTR &#91;rdx&#93;
  00003	4c 8b d2	 mov	 r10, rdx
  00006	c7 44 24 08 00
	00 00 00	 mov	 DWORD PTR n$&#91;rsp&#93;, 0
  0000e	48 23 c1	 and	 rax, rcx
  00011	c7 44 24 08 07
	00 00 00	 mov	 DWORD PTR e$&#91;rsp&#93;, 7
  00019	c7 44 24 08 38
	00 00 00	 mov	 DWORD PTR s$&#91;rsp&#93;, 56
  00021	4c 0f bc c8	 bsf	 r9, rax
  00025	48 8b 42 08	 mov	 rax, QWORD PTR &#91;rdx+8&#93;
  00029	c7 44 24 08 3f
	00 00 00	 mov	 DWORD PTR w$&#91;rsp&#93;, 63
  00031	48 23 c1	 and	 rax, rcx
  00034	4c 0f bc c0	 bsf	 r8, rax
  00038	48 8b 42 10	 mov	 rax, QWORD PTR &#91;rdx+16&#93;
  0003c	48 23 c1	 and	 rax, rcx
  0003f	48 0f bd d0	 bsr	 rdx, rax
  00043	49 8b 42 18	 mov	 rax, QWORD PTR &#91;r10+24&#93;
  00047	41 0f b7 54 52
	20		 movzx	 edx, WORD PTR &#91;r10+rdx*2+32&#93;
  0004d	48 23 c1	 and	 rax, rcx
  00050	48 0f bd c8	 bsr	 rcx, rax
  00054	41 0f b7 44 4a
	20		 movzx	 eax, WORD PTR &#91;r10+rcx*2+32&#93;
  0005a	48 8d 0d 00 00
	00 00		 lea	 rcx, OFFSET FLAT&#58;?rookLookup
  00061	48 0b c2	 or	 rax, rdx
  00064	43 0f b7 54 42
	20		 movzx	 edx, WORD PTR &#91;r10+r8*2+32&#93;
  0006a	48 0b c2	 or	 rax, rdx
  0006d	43 0f b7 54 4a
	20		 movzx	 edx, WORD PTR &#91;r10+r9*2+32&#93;
  00073	48 0b c2	 or	 rax, rdx
  00076	48 8b 04 c1	 mov	 rax, QWORD PTR &#91;rcx+rax*8&#93;
  0007a	c3		 ret	 0
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z ENDP
So i fear you have to do something like this to ensure that at least one bit is found by the bitscans.

Code: Select all

typedef struct
&#123;
	u64 inner&#91;4&#93;;
	u64 outer&#91;4&#93;;
	unsigned short index&#91;64&#93;;
&#125; SQUARE;

u64 rookAttacks&#40;u64 occ, SQUARE *sq&#41; &#123;
    u64 n = &#40;occ & sq->inner&#91;0&#93;) | sq->outer&#91;0&#93;;
    u64 e = &#40;occ & sq->inner&#91;1&#93;) | sq->outer&#91;1&#93;;
    u64 s = &#40;occ & sq->inner&#91;2&#93;) | sq->outer&#91;2&#93;;
    u64 w = &#40;occ & sq->inner&#91;3&#93;) | sq->outer&#91;3&#93;;
    _BitScanForward64&#40;&#40;unsigned long*)&n, n&#41;;
    _BitScanForward64&#40;&#40;unsigned long*)&e, e&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&s, s&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&w, w&#41;;
    u32 idx = sq->index&#91;n&#93; | sq->index&#91;e&#93; | sq->index&#91;s&#93; | sq->index&#91;w&#93;;
    return rookLookup&#91;idx&#93;;
&#125;


_TEXT	SEGMENT
occ$ = 8
sq$ = 16
  00000	4c 8b 42 10	 mov	 r8, QWORD PTR &#91;rdx+16&#93;
  00004	48 8b 42 18	 mov	 rax, QWORD PTR &#91;rdx+24&#93;
  00008	4c 8b 0a	 mov	 r9, QWORD PTR &#91;rdx&#93;
  0000b	4c 8b 52 08	 mov	 r10, QWORD PTR &#91;rdx+8&#93;
  0000f	48 23 c1	 and	 rax, rcx
  00012	4c 23 c9	 and	 r9, rcx
  00015	48 0b 42 38	 or	 rax, QWORD PTR &#91;rdx+56&#93;
  00019	4c 0b 4a 20	 or	 r9, QWORD PTR &#91;rdx+32&#93;
  0001d	4c 23 c1	 and	 r8, rcx
  00020	4c 0b 42 30	 or	 r8, QWORD PTR &#91;rdx+48&#93;
  00024	4c 23 d1	 and	 r10, rcx
  00027	48 0f bd c0	 bsr	 rax, rax
  0002b	4c 0b 52 28	 or	 r10, QWORD PTR &#91;rdx+40&#93;
  0002f	4c 8b da	 mov	 r11, rdx
  00032	49 0f bd c8	 bsr	 rcx, r8
  00036	41 0f b7 44 43
	40		 movzx	 eax, WORD PTR &#91;r11+rax*2+64&#93;
  0003c	41 0f b7 4c 4b
	40		 movzx	 ecx, WORD PTR &#91;r11+rcx*2+64&#93;
  00042	49 0f bc d2	 bsf	 rdx, r10
  00046	48 0b c1	 or	 rax, rcx
  00049	41 0f b7 4c 53
	40		 movzx	 ecx, WORD PTR &#91;r11+rdx*2+64&#93;
  0004f	4d 0f bc c9	 bsf	 r9, r9
  00053	48 0b c1	 or	 rax, rcx
  00056	43 0f b7 4c 4b
	40		 movzx	 ecx, WORD PTR &#91;r11+r9*2+64&#93;
  0005c	48 0b c1	 or	 rax, rcx
  0005f	48 8d 0d 00 00
	00 00		 lea	 rcx, OFFSET FLAT&#58;?rookLookup@@3PA_KA ; rookLookup
  00066	48 8b 04 c1	 mov	 rax, QWORD PTR &#91;rcx+rax*8&#93;
  0006a	c3		 ret	 0
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z ENDP			; rookAttacks
Of cousre instead of sq->outer[0..3] you may use immediate constants 0xFF00000000000000, 0x8080808080808080, 0x00000000000000FF, 0x0101010101010101, but that blows up the source code, while you can put it in the same cacheline as the inner bits for the potential blockers...

Now you need somebody to count cycles on a core 2 duo.
I better don't make any predictions ;-)

Gerd

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 11:22 pm

This one with only 4KByte lookup in total is another alternative based on your bitscan idea. But it needs some more registers to keep bits[sq][dir] for a while. bits[square][dir] contains all ray bits including boarder and edges, but exclusive the square, so some empty sets for boarder squares and appropriate direction. outer[square][dir] contains exactly one boarder-bit - it may include the square itself.

Code: Select all

u64 bits&#91;64&#93;&#91;4&#93;;
u64 outer&#91;64&#93;&#91;4&#93;;

u64 rookAttacks&#40;u64 occ, u32 sq&#41; &#123;
    u64 n = &#40;occ & bits&#91;sq&#93;&#91;0&#93;) | outer&#91;sq&#93;&#91;0&#93;;
    u64 e = &#40;occ & bits&#91;sq&#93;&#91;1&#93;) | outer&#91;sq&#93;&#91;1&#93;;
    u64 s = &#40;occ & bits&#91;sq&#93;&#91;2&#93;) | outer&#91;sq&#93;&#91;2&#93;;
    u64 w = &#40;occ & bits&#91;sq&#93;&#91;3&#93;) | outer&#91;sq&#93;&#91;3&#93;;
    _BitScanForward64&#40;&#40;unsigned long*)&n, n&#41;;
    _BitScanForward64&#40;&#40;unsigned long*)&e, e&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&s, s&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&w, w&#41;;
    return bits&#91;sq&#93;&#91;0&#93; ^ bits&#91;n&#93;&#91;0&#93;
        ^  bits&#91;sq&#93;&#91;1&#93; ^ bits&#91;e&#93;&#91;1&#93;
        ^  bits&#91;sq&#93;&#91;2&#93; ^ bits&#91;s&#93;&#91;2&#93;
        ^  bits&#91;sq&#93;&#91;3&#93; ^ bits&#91;w&#93;&#91;3&#93;
        ;
&#125;

Post Reply