New simple and fast bitboard move generator
Moderators: hgm, Rebel, chrisw
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: New simple and fast bitboard move generator
Hi Gerd,
That assembly looks fantastic!
Thanks again!
Mike
That assembly looks fantastic!
Thanks again!
Mike
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: New simple and fast bitboard move generator
I am still not quite sure whether it is guaranteed to use bitscan that way on a core 2 duo. For older processors intel explicitly stated that the destination register might be undefined after bsf/bsr with zero source, while Amd makes it quite clear:
AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and System
Instructions
Page 74
BSF Bit Scan Forward
Searches the value in a register or a memory location (second operand) for the leastsignificant set bit. If a set bit is found, the instruction clears the zero flag (ZF) and stores the index of the least-significant set bit in a destination register (first operand). If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. The bit index is an unsigned offset from bit 0 of the searched value.
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: New simple and fast bitboard move generator
I would hope that Intel was a bit smarter this time round!Gerd Isenberg wrote:I am still not quite sure whether it is guaranteed to use bitscan that way on a core 2 duo. For older processors intel explicitly stated that the destination register might be undefined after bsf/bsr with zero source, while Amd makes it quite clear:
AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and System
Instructions
Page 74
BSF Bit Scan Forward
Searches the value in a register or a memory location (second operand) for the leastsignificant set bit. If a set bit is found, the instruction clears the zero flag (ZF) and stores the index of the least-significant set bit in a destination register (first operand). If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. The bit index is an unsigned offset from bit 0 of the searched value.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: New simple and fast bitboard move generator
Would this be any better?
Edit: Or rather.
Code: Select all
typedef struct {
u64 bitsN;
u64 bitsE;
u64 bitsS;
u64 bitsW;
u32 idxN[64];
u32 idxE[64];
u32 idxS[64];
u32 idxW[64];
} square;
u64 rookAttacks(u64 occ, square *sq) {
unsigned long ln = 0, le = 7, ls = 56, lw = 63;
_BitScanForward64(&ln, occ & sq->bitsN);
_BitScanForward64(&le, occ & sq->bitsE);
_BitScanReverse64(&ls, occ & sq->bitsS);
_BitScanReverse64(&lw, occ & sq->bitsW);
u32 idx = sq->idxN[ln] | sq->idxE[le] | sq->idxS[ls] | sq->idxW[lw];
return rookLookup[idx];
Code: Select all
typedef struct {
u32 idxN;
u32 idxE;
u32 idxS;
u32 idxW;
} dir;
typedef struct {
u64 bitsN;
u64 bitsE;
u64 bitsS;
u64 bitsW;
dir idx[64];
} square;
u64 rookAttacks(u64 occ, square *sq) {
unsigned long ln = 0, le = 7, ls = 56, lw = 63;
_BitScanForward64(&ln, occ & sq->bitsN);
_BitScanForward64(&le, occ & sq->bitsE);
_BitScanReverse64(&ls, occ & sq->bitsS);
_BitScanReverse64(&lw, occ & sq->bitsW);
u32 idx = sq->idx[ln].idxN | sq->idx[le].idxE | sq->idx[ls].idxS | sq->idx[lw].idxW;
return rookLookup[idx];
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Re: New simple and fast bitboard move generator
How about something like:Gerd Isenberg wrote:Nice idea - but i fear too expensive due to the nine memory reads. In total 11 64-bit instructions, four 1/2KByte lookups and four 16KByte lookups, plus the final lookup by index. Note that you have to deal with bitscanning empty sets. May be you can gain some parallel speedup due to four independent instruction chains until the final ors and lookup, but i doubt that due to the eight leading (four 16KByte) memory-reads. Bitscans have latency of two cycles on core 2 duo, reciprocal thoughput of 1 cycle, thus two "parallel" bitscans can compete in two cycles as well...Michael Sherwin wrote:Since, bsf/bsr is very fast on core 2 this should be very competitive, in both 32 bit mode and 64 bit mode.Code: Select all
u32 RookIndex(u32 sq, u64 occ) { return = indexN[sq][FirstBit(occ & bitsN[sq])] | indexE[sq][FirstBit(occ & bitsE[sq])] | indexS[sq][LastBit (occ & bitsS[sq])] | indexW[sq][LastBit (occ & bitsW[sq])]; }
A rook on a1 only needs 3 bits + 3 bits = 6 bits of index, to hold the bit placements for the two directions. A rook on e4 needs 2 + 2 + 2 + 2 = 8 bits for the four directions. Remember not to include the final destination edge square.
This has the huge advantage that no 64 bit multiplication needs to be done and no shift back into place as well. Also it requires smaller lookup tables!
The returned index can be used to look up the mobility as well as the attack set.
My favourite lookup-approach seems still kindergarten for rooks and magic for bishops. Kindergarten rook takes in total 13 instructions including one imul, and two cheap lookups. Magic bishop requires three cheap lookups (from one cacheline), and/imul/shift, final <38KByte lookup.
Btw. 32-bit versions of magic and kindergarten don't need the expensive 64-bit mul call, but two 32-bit imuls. Bitscanning eight 32-bit words with 32-bit bsf/bsr is also not really cheap considering empty sets...
Gerd
Code: Select all
u32 RookIndex(u32 sq, u64 occ) {
return = indexN[File(sq)][Rank(FirstBit(occ & bitsN[sq]))]<<(Rank(sq)<<3)
| (indexE[Rank(sq)][File(FirstBit(occ & bitsE[sq]))]>>(File(sq)) & rankMask(Rank(sq)))
.....
}
e1 and e3 and shift it up.
I don't know if the smaller lookuptables outweigh the extra shifts.
EDIT: We could also shift for the file, treating everything as being on the a-File.
Tony
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: New simple and fast bitboard move generator
Michael Sherwin wrote:Would this be any better?
Edit: Or rather.Code: Select all
typedef struct { u64 bitsN; u64 bitsE; u64 bitsS; u64 bitsW; u32 idxN[64]; u32 idxE[64]; u32 idxS[64]; u32 idxW[64]; } square; u64 rookAttacks(u64 occ, square *sq) { unsigned long ln = 0, le = 7, ls = 56, lw = 63; _BitScanForward64(&ln, occ & sq->bitsN); _BitScanForward64(&le, occ & sq->bitsE); _BitScanReverse64(&ls, occ & sq->bitsS); _BitScanReverse64(&lw, occ & sq->bitsW); u32 idx = sq->idxN[ln] | sq->idxE[le] | sq->idxS[ls] | sq->idxW[lw]; return rookLookup[idx];
Code: Select all
typedef struct { u32 idxN; u32 idxE; u32 idxS; u32 idxW; } dir; typedef struct { u64 bitsN; u64 bitsE; u64 bitsS; u64 bitsW; dir idx[64]; } square; u64 rookAttacks(u64 occ, square *sq) { unsigned long ln = 0, le = 7, ls = 56, lw = 63; _BitScanForward64(&ln, occ & sq->bitsN); _BitScanForward64(&le, occ & sq->bitsE); _BitScanReverse64(&ls, occ & sq->bitsS); _BitScanReverse64(&lw, occ & sq->bitsW); u32 idx = sq->idx[ln].idxN | sq->idx[le].idxE | sq->idx[ls].idxS | sq->idx[lw].idxW; return rookLookup[idx];
Yes, the pointer may safe some leas. Why did you use redundant indices for the four directions? If a blocking square is found, as well if not found, the preinitialized squares (0,7,56,63) already imply the direction.
Code: Select all
typedef struct {
u16 idx[64];
u64 bitsN;
u64 bitsE;
u64 bitsS;
u64 bitsW;
} square;
u64 rookAttacks(u64 occ, square *sq) {
unsigned long ln = 0, le = 7, ls = 56, lw = 63;
_BitScanForward64(&ln, occ & sq->bitsN);
_BitScanForward64(&le, occ & sq->bitsE);
_BitScanReverse64(&ls, occ & sq->bitsS);
_BitScanReverse64(&lw, occ & sq->bitsW);
u32 idx = sq->idx[ln] | sq->idx[le] | sq->idx[ls] | sq->idx[lw];
return rookLookup[idx];
}
Gerd
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: New simple and fast bitboard move generator
Of course that is possible for shorter tables, but you bump up the 9 memory accesses with shifts by cl. That will make it terribly slow, since it foils the hope for parallel computation of all four directions.Tony wrote: How about something like:
The idea is that for a rook on e4 and a piece on e6, we find the bitboard ofCode: Select all
u32 RookIndex(u32 sq, u64 occ) { return = indexN[File(sq)][Rank(FirstBit(occ & bitsN[sq]))]<<(Rank(sq)<<3) | (indexE[Rank(sq)][File(FirstBit(occ & bitsE[sq]))]>>(File(sq)) & rankMask(Rank(sq))) ..... }
e1 and e3 and shift it up.
I don't know if the smaller lookuptables outweigh the extra shifts.
EDIT: We could also shift for the file, treating everything as being on the a-File.
Tony
Gerd
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: New simple and fast bitboard move generator
Hi Gerd,
Mike
To put it simple, this stuff is over my head. I have to struggle mightilly to make any of my ideas work. So, to understand someone elses ideas and code is near impossible for me--I'm not exagerating. But, I think that I understand your changes better now.Why did you use redundant indices for the four directions?
Mike
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: New simple and fast bitboard move generator
It does not work, if you look to the assembly:Gerd Isenberg wrote:
I am too short in time now - what is saw yesterday was quite strange - the vc2005 initilalized some variables on the stack, but failed to load it to the destination registers before the bitscans!? I will have a closer look later...
Gerd
The vc2005 compiler initializes the same local variable on the stack with 0,7,56,63 but fails to initialize the destination registers of the bitscans. As a result of this unspecified usage of bsf/bsr the complete function is serialized!
Code: Select all
typedef struct
{
u64 bits[4];
unsigned short index[64];
} SQUARE;
u64 rookAttacks(u64 occ, SQUARE *sq) {
unsigned long n = 0, e = 7, s = 56, w = 63;
_BitScanForward64(&n, occ & sq->bits[0]);
_BitScanForward64(&e, occ & sq->bits[1]);
_BitScanReverse64(&s, occ & sq->bits[2]);
_BitScanReverse64(&w, occ & sq->bits[3]);
u32 idx = sq->index[n] | sq->index[e] | sq->index[s] | sq->index[w];
return rookLookup[idx];
}
_TEXT SEGMENT
w$ = 8
s$ = 8
e$ = 8
n$ = 8
occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z PROC
00000 48 8b 02 mov rax, QWORD PTR [rdx]
00003 4c 8b d2 mov r10, rdx
00006 c7 44 24 08 00
00 00 00 mov DWORD PTR n$[rsp], 0
0000e 48 23 c1 and rax, rcx
00011 c7 44 24 08 07
00 00 00 mov DWORD PTR e$[rsp], 7
00019 c7 44 24 08 38
00 00 00 mov DWORD PTR s$[rsp], 56
00021 4c 0f bc c8 bsf r9, rax
00025 48 8b 42 08 mov rax, QWORD PTR [rdx+8]
00029 c7 44 24 08 3f
00 00 00 mov DWORD PTR w$[rsp], 63
00031 48 23 c1 and rax, rcx
00034 4c 0f bc c0 bsf r8, rax
00038 48 8b 42 10 mov rax, QWORD PTR [rdx+16]
0003c 48 23 c1 and rax, rcx
0003f 48 0f bd d0 bsr rdx, rax
00043 49 8b 42 18 mov rax, QWORD PTR [r10+24]
00047 41 0f b7 54 52
20 movzx edx, WORD PTR [r10+rdx*2+32]
0004d 48 23 c1 and rax, rcx
00050 48 0f bd c8 bsr rcx, rax
00054 41 0f b7 44 4a
20 movzx eax, WORD PTR [r10+rcx*2+32]
0005a 48 8d 0d 00 00
00 00 lea rcx, OFFSET FLAT:?rookLookup
00061 48 0b c2 or rax, rdx
00064 43 0f b7 54 42
20 movzx edx, WORD PTR [r10+r8*2+32]
0006a 48 0b c2 or rax, rdx
0006d 43 0f b7 54 4a
20 movzx edx, WORD PTR [r10+r9*2+32]
00073 48 0b c2 or rax, rdx
00076 48 8b 04 c1 mov rax, QWORD PTR [rcx+rax*8]
0007a c3 ret 0
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z ENDP
Code: Select all
typedef struct
{
u64 inner[4];
u64 outer[4];
unsigned short index[64];
} SQUARE;
u64 rookAttacks(u64 occ, SQUARE *sq) {
u64 n = (occ & sq->inner[0]) | sq->outer[0];
u64 e = (occ & sq->inner[1]) | sq->outer[1];
u64 s = (occ & sq->inner[2]) | sq->outer[2];
u64 w = (occ & sq->inner[3]) | sq->outer[3];
_BitScanForward64((unsigned long*)&n, n);
_BitScanForward64((unsigned long*)&e, e);
_BitScanReverse64((unsigned long*)&s, s);
_BitScanReverse64((unsigned long*)&w, w);
u32 idx = sq->index[n] | sq->index[e] | sq->index[s] | sq->index[w];
return rookLookup[idx];
}
_TEXT SEGMENT
occ$ = 8
sq$ = 16
00000 4c 8b 42 10 mov r8, QWORD PTR [rdx+16]
00004 48 8b 42 18 mov rax, QWORD PTR [rdx+24]
00008 4c 8b 0a mov r9, QWORD PTR [rdx]
0000b 4c 8b 52 08 mov r10, QWORD PTR [rdx+8]
0000f 48 23 c1 and rax, rcx
00012 4c 23 c9 and r9, rcx
00015 48 0b 42 38 or rax, QWORD PTR [rdx+56]
00019 4c 0b 4a 20 or r9, QWORD PTR [rdx+32]
0001d 4c 23 c1 and r8, rcx
00020 4c 0b 42 30 or r8, QWORD PTR [rdx+48]
00024 4c 23 d1 and r10, rcx
00027 48 0f bd c0 bsr rax, rax
0002b 4c 0b 52 28 or r10, QWORD PTR [rdx+40]
0002f 4c 8b da mov r11, rdx
00032 49 0f bd c8 bsr rcx, r8
00036 41 0f b7 44 43
40 movzx eax, WORD PTR [r11+rax*2+64]
0003c 41 0f b7 4c 4b
40 movzx ecx, WORD PTR [r11+rcx*2+64]
00042 49 0f bc d2 bsf rdx, r10
00046 48 0b c1 or rax, rcx
00049 41 0f b7 4c 53
40 movzx ecx, WORD PTR [r11+rdx*2+64]
0004f 4d 0f bc c9 bsf r9, r9
00053 48 0b c1 or rax, rcx
00056 43 0f b7 4c 4b
40 movzx ecx, WORD PTR [r11+r9*2+64]
0005c 48 0b c1 or rax, rcx
0005f 48 8d 0d 00 00
00 00 lea rcx, OFFSET FLAT:?rookLookup@@3PA_KA ; rookLookup
00066 48 8b 04 c1 mov rax, QWORD PTR [rcx+rax*8]
0006a c3 ret 0
?rookAttacks@@YA_K_KPEAUSQUARE@@@Z ENDP ; rookAttacks
Now you need somebody to count cycles on a core 2 duo.
I better don't make any predictions
Gerd
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: New simple and fast bitboard move generator
This one with only 4KByte lookup in total is another alternative based on your bitscan idea. But it needs some more registers to keep bits[sq][dir] for a while. bits[square][dir] contains all ray bits including boarder and edges, but exclusive the square, so some empty sets for boarder squares and appropriate direction. outer[square][dir] contains exactly one boarder-bit - it may include the square itself.
Code: Select all
u64 bits[64][4];
u64 outer[64][4];
u64 rookAttacks(u64 occ, u32 sq) {
u64 n = (occ & bits[sq][0]) | outer[sq][0];
u64 e = (occ & bits[sq][1]) | outer[sq][1];
u64 s = (occ & bits[sq][2]) | outer[sq][2];
u64 w = (occ & bits[sq][3]) | outer[sq][3];
_BitScanForward64((unsigned long*)&n, n);
_BitScanForward64((unsigned long*)&e, e);
_BitScanReverse64((unsigned long*)&s, s);
_BitScanReverse64((unsigned long*)&w, w);
return bits[sq][0] ^ bits[n][0]
^ bits[sq][1] ^ bits[e][1]
^ bits[sq][2] ^ bits[s][2]
^ bits[sq][3] ^ bits[w][3]
;
}