New simple and fast bitboard move generator

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 8:40 am

Michael Sherwin wrote:Would this be any better?

Code: Select all

typedef struct {
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  u32 idxN[64];
  u32 idxE[64];
  u32 idxS[64];
  u32 idxW[64];
} square;

u64 rookAttacks(u64 occ, square *sq) {
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64(&ln, occ & sq->bitsN); 
    _BitScanForward64(&le, occ & sq->bitsE); 
    _BitScanReverse64(&ls, occ & sq->bitsS); 
    _BitScanReverse64(&lw, occ & sq->bitsW); 
    u32 idx = sq->idxN[ln] | sq->idxE[le] | sq->idxS[ls] | sq->idxW[lw]; 
    return rookLookup[idx]; 
Edit: Or rather.

Code: Select all

typedef struct {
  u32 idxN;
  u32 idxE;
  u32 idxS;
  u32 idxW;
} dir;

typedef struct {
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
  dir idx[64];
} square;

u64 rookAttacks(u64 occ, square *sq) {
    unsigned long ln = 0, le = 7, ls = 56, lw = 63; 
    _BitScanForward64(&ln, occ & sq->bitsN); 
    _BitScanForward64(&le, occ & sq->bitsE); 
    _BitScanReverse64(&ls, occ & sq->bitsS); 
    _BitScanReverse64(&lw, occ & sq->bitsW); 
    u32 idx = sq->idx[ln].idxN | sq->idx[le].idxE | sq->idx[ls].idxS | sq->idx[lw].idxW; 
    return rookLookup[idx]; 

Yes, the pointer may safe some leas. Why did you use redundant indices for the four directions? If a blocking square is found, as well if not found, the preinitialized squares (0,7,56,63) already imply the direction.

Code: Select all

typedef struct {
  u16 idx[64];
  u64 bitsN;
  u64 bitsE;
  u64 bitsS;
  u64 bitsW;
} square;

u64 rookAttacks(u64 occ, square *sq) {
    unsigned long ln = 0, le = 7, ls = 56, lw = 63;
    _BitScanForward64(&ln, occ & sq->bitsN);
    _BitScanForward64(&le, occ & sq->bitsE);
    _BitScanReverse64(&ls, occ & sq->bitsS);
    _BitScanReverse64(&lw, occ & sq->bitsW);
    u32 idx = sq->idx[ln] | sq->idx[le] | sq->idx[ls] | sq->idx[lw];
    return rookLookup[idx];
}
I am too short in time now - what is saw yesterday was quite strange - the vc2005 initilalized some variables on the stack, but failed to load it to the destination registers before the bitscans!? I will have a closer look later...

Gerd

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 8:51 am

Tony wrote: How about something like:

Code: Select all


u32 RookIndex(u32 sq, u64 occ) {
    return = indexN&#91;File&#40;sq&#41;&#93;&#91;Rank&#40;FirstBit&#40;occ & bitsN&#91;sq&#93;))&#93;<<&#40;Rank&#40;sq&#41;<<3&#41;
           | &#40;indexE&#91;Rank&#40;sq&#41;&#93;&#91;File&#40;FirstBit&#40;occ & bitsE&#91;sq&#93;))&#93;>>&#40;File&#40;sq&#41;) & rankMask&#40;Rank&#40;sq&#41;))
           .....

&#125;
The idea is that for a rook on e4 and a piece on e6, we find the bitboard of
e1 and e3 and shift it up.

I don't know if the smaller lookuptables outweigh the extra shifts.

EDIT: We could also shift for the file, treating everything as being on the a-File.

Tony
Of course that is possible for shorter tables, but you bump up the 9 memory accesses with shifts by cl. That will make it terribly slow, since it foils the hope for parallel computation of all four directions.

Gerd

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Mon May 07, 2007 11:06 am

Hi Gerd,
Why did you use redundant indices for the four directions?
To put it simple, this stuff is over my head. I have to struggle mightilly to make any of my ideas work. So, to understand someone elses ideas and code is near impossible for me--I'm not exagerating. But, I think that I understand your changes better now.

Mike
Regards,
Mike

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 8:08 pm

Gerd Isenberg wrote:
I am too short in time now - what is saw yesterday was quite strange - the vc2005 initilalized some variables on the stack, but failed to load it to the destination registers before the bitscans!? I will have a closer look later...

Gerd
It does not work, if you look to the assembly:

The vc2005 compiler initializes the same local variable on the stack with 0,7,56,63 but fails to initialize the destination registers of the bitscans. As a result of this unspecified usage of bsf/bsr the complete function is serialized!

Code: Select all

typedef struct
&#123;
    u64 bits&#91;4&#93;;
    unsigned short index&#91;64&#93;;
&#125; SQUARE;

u64 rookAttacks&#40;u64 occ, SQUARE *sq&#41; &#123;
    unsigned long n = 0, e = 7, s = 56, w = 63;
    _BitScanForward64&#40;&n, occ & sq->bits&#91;0&#93;);
    _BitScanForward64&#40;&e, occ & sq->bits&#91;1&#93;);
    _BitScanReverse64&#40;&s, occ & sq->bits&#91;2&#93;);
    _BitScanReverse64&#40;&w, occ & sq->bits&#91;3&#93;);
     u32 idx = sq->index&#91;n&#93; | sq->index&#91;e&#93; | sq->index&#91;s&#93; | sq->index&#91;w&#93;;
     return rookLookup&#91;idx&#93;;
&#125;


_TEXT	SEGMENT
w$ = 8
s$ = 8
e$ = 8
n$ = 8
occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z PROC
  00000	48 8b 02	 mov	 rax, QWORD PTR &#91;rdx&#93;
  00003	4c 8b d2	 mov	 r10, rdx
  00006	c7 44 24 08 00
	00 00 00	 mov	 DWORD PTR n$&#91;rsp&#93;, 0
  0000e	48 23 c1	 and	 rax, rcx
  00011	c7 44 24 08 07
	00 00 00	 mov	 DWORD PTR e$&#91;rsp&#93;, 7
  00019	c7 44 24 08 38
	00 00 00	 mov	 DWORD PTR s$&#91;rsp&#93;, 56
  00021	4c 0f bc c8	 bsf	 r9, rax
  00025	48 8b 42 08	 mov	 rax, QWORD PTR &#91;rdx+8&#93;
  00029	c7 44 24 08 3f
	00 00 00	 mov	 DWORD PTR w$&#91;rsp&#93;, 63
  00031	48 23 c1	 and	 rax, rcx
  00034	4c 0f bc c0	 bsf	 r8, rax
  00038	48 8b 42 10	 mov	 rax, QWORD PTR &#91;rdx+16&#93;
  0003c	48 23 c1	 and	 rax, rcx
  0003f	48 0f bd d0	 bsr	 rdx, rax
  00043	49 8b 42 18	 mov	 rax, QWORD PTR &#91;r10+24&#93;
  00047	41 0f b7 54 52
	20		 movzx	 edx, WORD PTR &#91;r10+rdx*2+32&#93;
  0004d	48 23 c1	 and	 rax, rcx
  00050	48 0f bd c8	 bsr	 rcx, rax
  00054	41 0f b7 44 4a
	20		 movzx	 eax, WORD PTR &#91;r10+rcx*2+32&#93;
  0005a	48 8d 0d 00 00
	00 00		 lea	 rcx, OFFSET FLAT&#58;?rookLookup
  00061	48 0b c2	 or	 rax, rdx
  00064	43 0f b7 54 42
	20		 movzx	 edx, WORD PTR &#91;r10+r8*2+32&#93;
  0006a	48 0b c2	 or	 rax, rdx
  0006d	43 0f b7 54 4a
	20		 movzx	 edx, WORD PTR &#91;r10+r9*2+32&#93;
  00073	48 0b c2	 or	 rax, rdx
  00076	48 8b 04 c1	 mov	 rax, QWORD PTR &#91;rcx+rax*8&#93;
  0007a	c3		 ret	 0
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z ENDP
So i fear you have to do something like this to ensure that at least one bit is found by the bitscans.

Code: Select all

typedef struct
&#123;
	u64 inner&#91;4&#93;;
	u64 outer&#91;4&#93;;
	unsigned short index&#91;64&#93;;
&#125; SQUARE;

u64 rookAttacks&#40;u64 occ, SQUARE *sq&#41; &#123;
    u64 n = &#40;occ & sq->inner&#91;0&#93;) | sq->outer&#91;0&#93;;
    u64 e = &#40;occ & sq->inner&#91;1&#93;) | sq->outer&#91;1&#93;;
    u64 s = &#40;occ & sq->inner&#91;2&#93;) | sq->outer&#91;2&#93;;
    u64 w = &#40;occ & sq->inner&#91;3&#93;) | sq->outer&#91;3&#93;;
    _BitScanForward64&#40;&#40;unsigned long*)&n, n&#41;;
    _BitScanForward64&#40;&#40;unsigned long*)&e, e&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&s, s&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&w, w&#41;;
    u32 idx = sq->index&#91;n&#93; | sq->index&#91;e&#93; | sq->index&#91;s&#93; | sq->index&#91;w&#93;;
    return rookLookup&#91;idx&#93;;
&#125;


_TEXT	SEGMENT
occ$ = 8
sq$ = 16
  00000	4c 8b 42 10	 mov	 r8, QWORD PTR &#91;rdx+16&#93;
  00004	48 8b 42 18	 mov	 rax, QWORD PTR &#91;rdx+24&#93;
  00008	4c 8b 0a	 mov	 r9, QWORD PTR &#91;rdx&#93;
  0000b	4c 8b 52 08	 mov	 r10, QWORD PTR &#91;rdx+8&#93;
  0000f	48 23 c1	 and	 rax, rcx
  00012	4c 23 c9	 and	 r9, rcx
  00015	48 0b 42 38	 or	 rax, QWORD PTR &#91;rdx+56&#93;
  00019	4c 0b 4a 20	 or	 r9, QWORD PTR &#91;rdx+32&#93;
  0001d	4c 23 c1	 and	 r8, rcx
  00020	4c 0b 42 30	 or	 r8, QWORD PTR &#91;rdx+48&#93;
  00024	4c 23 d1	 and	 r10, rcx
  00027	48 0f bd c0	 bsr	 rax, rax
  0002b	4c 0b 52 28	 or	 r10, QWORD PTR &#91;rdx+40&#93;
  0002f	4c 8b da	 mov	 r11, rdx
  00032	49 0f bd c8	 bsr	 rcx, r8
  00036	41 0f b7 44 43
	40		 movzx	 eax, WORD PTR &#91;r11+rax*2+64&#93;
  0003c	41 0f b7 4c 4b
	40		 movzx	 ecx, WORD PTR &#91;r11+rcx*2+64&#93;
  00042	49 0f bc d2	 bsf	 rdx, r10
  00046	48 0b c1	 or	 rax, rcx
  00049	41 0f b7 4c 53
	40		 movzx	 ecx, WORD PTR &#91;r11+rdx*2+64&#93;
  0004f	4d 0f bc c9	 bsf	 r9, r9
  00053	48 0b c1	 or	 rax, rcx
  00056	43 0f b7 4c 4b
	40		 movzx	 ecx, WORD PTR &#91;r11+r9*2+64&#93;
  0005c	48 0b c1	 or	 rax, rcx
  0005f	48 8d 0d 00 00
	00 00		 lea	 rcx, OFFSET FLAT&#58;?rookLookup@@3PA_KA ; rookLookup
  00066	48 8b 04 c1	 mov	 rax, QWORD PTR &#91;rcx+rax*8&#93;
  0006a	c3		 ret	 0
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z ENDP			; rookAttacks
Of cousre instead of sq->outer[0..3] you may use immediate constants 0xFF00000000000000, 0x8080808080808080, 0x00000000000000FF, 0x0101010101010101, but that blows up the source code, while you can put it in the same cacheline as the inner bits for the potential blockers...

Now you need somebody to count cycles on a core 2 duo.
I better don't make any predictions ;-)

Gerd

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Mon May 07, 2007 11:22 pm

This one with only 4KByte lookup in total is another alternative based on your bitscan idea. But it needs some more registers to keep bits[sq][dir] for a while. bits[square][dir] contains all ray bits including boarder and edges, but exclusive the square, so some empty sets for boarder squares and appropriate direction. outer[square][dir] contains exactly one boarder-bit - it may include the square itself.

Code: Select all

u64 bits&#91;64&#93;&#91;4&#93;;
u64 outer&#91;64&#93;&#91;4&#93;;

u64 rookAttacks&#40;u64 occ, u32 sq&#41; &#123;
    u64 n = &#40;occ & bits&#91;sq&#93;&#91;0&#93;) | outer&#91;sq&#93;&#91;0&#93;;
    u64 e = &#40;occ & bits&#91;sq&#93;&#91;1&#93;) | outer&#91;sq&#93;&#91;1&#93;;
    u64 s = &#40;occ & bits&#91;sq&#93;&#91;2&#93;) | outer&#91;sq&#93;&#91;2&#93;;
    u64 w = &#40;occ & bits&#91;sq&#93;&#91;3&#93;) | outer&#91;sq&#93;&#91;3&#93;;
    _BitScanForward64&#40;&#40;unsigned long*)&n, n&#41;;
    _BitScanForward64&#40;&#40;unsigned long*)&e, e&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&s, s&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&w, w&#41;;
    return bits&#91;sq&#93;&#91;0&#93; ^ bits&#91;n&#93;&#91;0&#93;
        ^  bits&#91;sq&#93;&#91;1&#93; ^ bits&#91;e&#93;&#91;1&#93;
        ^  bits&#91;sq&#93;&#91;2&#93; ^ bits&#91;s&#93;&#91;2&#93;
        ^  bits&#91;sq&#93;&#91;3&#93; ^ bits&#91;w&#93;&#91;3&#93;
        ;
&#125;

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Tue May 08, 2007 3:18 am

Gerd Isenberg wrote:This one with only 4KByte lookup in total is another alternative based on your bitscan idea. But it needs some more registers to keep bits[sq][dir] for a while. bits[square][dir] contains all ray bits including boarder and edges, but exclusive the square, so some empty sets for boarder squares and appropriate direction. outer[square][dir] contains exactly one boarder-bit - it may include the square itself.

Code: Select all

u64 bits&#91;64&#93;&#91;4&#93;;
u64 outer&#91;64&#93;&#91;4&#93;;

u64 rookAttacks&#40;u64 occ, u32 sq&#41; &#123;
    u64 n = &#40;occ & bits&#91;sq&#93;&#91;0&#93;) | outer&#91;sq&#93;&#91;0&#93;;
    u64 e = &#40;occ & bits&#91;sq&#93;&#91;1&#93;) | outer&#91;sq&#93;&#91;1&#93;;
    u64 s = &#40;occ & bits&#91;sq&#93;&#91;2&#93;) | outer&#91;sq&#93;&#91;2&#93;;
    u64 w = &#40;occ & bits&#91;sq&#93;&#91;3&#93;) | outer&#91;sq&#93;&#91;3&#93;;
    _BitScanForward64&#40;&#40;unsigned long*)&n, n&#41;;
    _BitScanForward64&#40;&#40;unsigned long*)&e, e&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&s, s&#41;;
    _BitScanReverse64&#40;&#40;unsigned long*)&w, w&#41;;
    return bits&#91;sq&#93;&#91;0&#93; ^ bits&#91;n&#93;&#91;0&#93;
        ^  bits&#91;sq&#93;&#91;1&#93; ^ bits&#91;e&#93;&#91;1&#93;
        ^  bits&#91;sq&#93;&#91;2&#93; ^ bits&#91;s&#93;&#91;2&#93;
        ^  bits&#91;sq&#93;&#91;3&#93; ^ bits&#91;w&#93;&#91;3&#93;
        ;
&#125;
Your hand assembly version looked extreamly fast while the new c code version does not seem very fast in comparison. Does the hand assembly work? Regardless, there is lots to work with now. Thanks.
Regards,
Mike

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Tue May 08, 2007 6:11 am

Michael Sherwin wrote: Your hand assembly version looked extreamly fast while the new c code version does not seem very fast in comparison. Does the hand assembly work? Regardless, there is lots to work with now. Thanks.
Not quite sure about x64 addressing modes in ML64 - RIP/EIP-relative? It should not be to hard to link it with C-code, considering calling conventions, scratch register usage etc.. Despite it's nice possible parallel computation of four independent chains, i still don't believe it may be faster than kindergarten-rookattacks, but i may be wrong.

How many cycles does it take to load one 64-bit cacheline (8 cacheline aligned bitboards) from L1 into eight general purpose registers on a core 2 duo?

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Tue May 08, 2007 9:25 pm

Would this fix the problem?

The corner squares do not matter as far as occupancy goes, so just make sure the corner bits are set.

Also make sure the appropriate corner bit is set for the direction bits.

Now there will always be at least one set bit to find!

And it takes less instructions!!

Code: Select all

typedef struct &#123; 
  u16 idx&#91;64&#93;; 
  u64 bitsN; 
  u64 bitsE; 
  u64 bitsS; 
  u64 bitsW; 
&#125; square; 

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123; 
    unsigned long ln, le, ls, lw; 
    occ |= fourCorners;
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;; 
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;; 
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;; 
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;; 
    u32 idx = sq->idx&#91;ln&#93; | sq->idx&#91;le&#93; | sq->idx&#91;ls&#93; | sq->idx&#91;lw&#93;; 
    return rookLookup&#91;idx&#93;; 
Regards,
Mike

Michael Sherwin
Posts: 2798
Joined: Fri May 26, 2006 1:00 am
Location: OH, USA

Re: New simple and fast bitboard move generator

Post by Michael Sherwin » Tue May 08, 2007 10:55 pm

Also it would seem, OIMBFOS, that better parallel execution at the end may be achived by ORing two of the direction indexs together and then the other two and finally those two results.
Regards,
Mike

Gerd Isenberg
Posts: 2103
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: New simple and fast bitboard move generator

Post by Gerd Isenberg » Wed May 09, 2007 3:18 am

Michael Sherwin wrote:Would this fix the problem?

The corner squares do not matter as far as occupancy goes, so just make sure the corner bits are set.

Also make sure the appropriate corner bit is set for the direction bits.

Now there will always be at least one set bit to find!

And it takes less instructions!!

Code: Select all

typedef struct &#123; 
  u16 idx&#91;64&#93;; 
  u64 bitsN; 
  u64 bitsE; 
  u64 bitsS; 
  u64 bitsW; 
&#125; square; 

u64 rookAttacks&#40;u64 occ, square *sq&#41; &#123; 
    unsigned long ln, le, ls, lw; 
    occ |= fourCorners;
    _BitScanForward64&#40;&ln, occ & sq->bitsN&#41;; 
    _BitScanForward64&#40;&le, occ & sq->bitsE&#41;; 
    _BitScanReverse64&#40;&ls, occ & sq->bitsS&#41;; 
    _BitScanReverse64&#40;&lw, occ & sq->bitsW&#41;; 
    u32 idx = sq->idx&#91;ln&#93; | sq->idx&#91;le&#93; | sq->idx&#91;ls&#93; | sq->idx&#91;lw&#93;; 
    return rookLookup&#91;idx&#93;; 
Yes, very good - that would work!

Post Reply