Assuming bsf/bsr doesn't change the target register if source is zero, and zero is never a found index here, a pre-initialization of the target-register is not necessary if it is already the source. To use only one cacheline for the occ-ands instead of four, i suggest to pack four bitboards into a struct (quad-bitboard) and to have an array of 64 quad-bitboards. Is your index unique over all squares or a char-range only per square? I tried it on the fly with 16-bit short indices, to get some assembly to inspect.
Code: Select all
u64 bits[64][4];
unsigned short indexN[64][64];
unsigned short indexE[64][64];
unsigned short indexS[64][64];
unsigned short indexW[64][64];
u64 rookLookup[...];
__forceinline
u64 rookAttacks(u64 occ, u32 sq) {
u64 n = occ & bits[sq][0];
u64 e = occ & bits[sq][1];
u64 s = occ & bits[sq][2];
u64 w = occ & bits[sq][3];
_BitScanForward64((unsigned long*)&n, n);
_BitScanForward64((unsigned long*)&e, e);
_BitScanReverse64((unsigned long*)&s, s);
_BitScanReverse64((unsigned long*)&w, w);
u32 idx = indexN[sq][n] | indexE[sq][e] | indexS[sq][s] | indexW[sq][w];
return rookLookup[idx];
}
occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KI@Z PROC
00000 4c 8d 1d 00 00
00 00 lea r11, OFFSET FLAT:__ImageBase
00007 4c 8b c9 mov r9, rcx
0000a 8b c2 mov eax, edx
0000c 48 c1 e0 05 shl rax, 5
00010 44 8b d2 mov r10d, edx
00013 4e 8b 84 18 00
00 00 00 mov r8, QWORD PTR ?bits[rax+r11]
0001b 4a 8b 94 18 08
00 00 00 mov rdx, QWORD PTR ?bits[rax+r11+8]
00023 49 c1 e2 06 shl r10, 6
00027 48 23 d1 and rdx, rcx
0002a 4c 23 c1 and r8, rcx
0002d 4a 8b 8c 18 10
00 00 00 mov rcx, QWORD PTR ?bits[rax+r11+16]
00035 4a 8b 84 18 18
00 00 00 mov rax, QWORD PTR ?bits[rax+r11+24]
0003d 49 23 c9 and rcx, r9
00040 49 23 c1 and rax, r9
00043 48 0f bd c9 bsr rcx, rcx
00047 4d 0f bc c8 bsf r9, r8
0004b 4c 0f bc c2 bsf r8, rdx
0004f 48 0f bd c0 bsr rax, rax
00053 49 03 c2 add rax, r10
00056 41 0f b7 94 43
00 00 00 00 movzx edx, WORD PTR ?indexW[r11+rax*2]
0005f 49 8d 04 0a lea rax, QWORD PTR [r10+rcx]
00063 4b 8d 0c 02 lea rcx, QWORD PTR [r10+r8]
00067 41 0f b7 8c 4b
00 00 00 00 movzx ecx, WORD PTR ?indexE[r11+rcx*2]
00070 41 0f b7 84 43
00 00 00 00 movzx eax, WORD PTR ?indexS[r11+rax*2]
00079 48 0b d0 or rdx, rax
0007c 48 0b d1 or rdx, rcx
0007f 4b 8d 0c 0a lea rcx, QWORD PTR [r10+r9]
00083 41 0f b7 8c 4b
00 00 00 00 movzx ecx, WORD PTR ?indexN[r11+rcx*2]
0008c 48 0b d1 or rdx, rcx
0008f 49 8b 84 d3 00
00 00 00 mov rax, QWORD PTR ?rookLookup[r11+rdx*8]
00097 c3 ret 0
?rookAttacks@@YA_K_KI@Z ENDP
Quite modest register usage and the independent instruction chains until the middle of the routine. May be it is improvable - at least code size wise in writing assembly. But again, so far you need to transer six cachelines by nine memory reads, instaed of two in kindergarten-bitboards out of 4.5 KByte. It would be nice to combine the four index-arrays to one, since a found 6-inner blocking square, already implies the attack direction as well. But in the case of no blocking square is found by bsf/bsr, one needs the direction.
For comparison - kindergarten rook-attacks. Shifts by cl seems the bottleneck to probably perform better ipc - but one has to consider shadow-registers and OOOE. The one-dimensional firstRankAttacks2-array saves one C-instruction with explicit *4 scaling instead of implicit *8. The resulting code was shorter and even saved a register with vc2005. Unfortunately the similar trick doesn't work to save the shr 3 with a one-dimensional aFileAttacks[8*64] array.
Code: Select all
extern u64 CACHE_ALIGN aFileAttacks[64][8]; // 4 KByte
extern u8 CACHE_ALIGN firstRankAttacks2[64*8]; // 0.5 KByte
u64 rookAttacks(u64 occ, u32 sq) {
u32 k = sq & 7;
u32 i = sq & 56;
u32 n = (u32)(occ >> i) & 2*63;
u64 d = firstRankAttacks2[4*n+k]; // [n/2][k]
u64 e = 0x0101010101010101 & (occ >> k);
u64 r = 0x0080402010080400 * e >> 58;
u64 garten = aFileAttacks[r][i >> 3];
return ( d << i) | (garten << k);
}
occ$ = 8
sq$ = 16
?rookAttacks@@YA_K_KI@Z PROC
00000 44 8b c2 mov r8d, edx
00003 83 e2 38 and edx, 56
00006 4c 8b d1 mov r10, rcx
00009 44 8b ca mov r9d, edx
0000c 49 8b d2 mov rdx, r10
0000f 41 83 e0 07 and r8d, 7
00013 41 8b c8 mov ecx, r8d
00016 48 b8 01 01 01
01 01 01 01 01 mov rax, 0101010101010101H
00020 4c 8d 1d 00 00
00 00 lea r11, OFFSET FLAT:__ImageBase
00027 48 d3 ea shr rdx, cl
0002a 48 23 d0 and rdx, rax
0002d 48 b8 00 04 08
10 20 40 80 00 mov rax, 0080402010080400H
00037 48 0f af d0 imul rdx, rax
0003b 49 8b c1 mov rax, r9
0003e 48 c1 e8 03 shr rax, 3
00042 48 c1 ea 3a shr rdx, 58
00046 48 8d 14 d0 lea rdx, QWORD PTR [rax+rdx*8]
0004a 49 8b 84 d3 00
00 00 00 mov rax, QWORD PTR ?aFileAttacks[r11+rdx*8]
00052 48 d3 e0 shl rax, cl
00055 49 8b c9 mov rcx, r9
00058 49 d3 ea shr r10, cl
0005b 41 83 e2 7e and r10d, 126
0005f 43 8d 14 90 lea edx, DWORD PTR [r8+r10*4]
00063 42 0f b6 94 1a
00 00 00 00 movzx edx, BYTE PTR ?firstRankAttacks2[rdx+r11]
0006c 48 d3 e2 shl rdx, cl
0006f 48 0b c2 or rax, rdx
00072 c3 ret 0
Gerd