bob wrote:
I was basing my assumption on this:
I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.
Although I should have said 256 bit rather than 128 bit...
Yes, but that has nothing to do with multiplication as well, since it it only about using vectors of bitboards, each bitboard in parallel.
Kogge Stone can "emulate" the "directional sub" with shifts in the appropriate directions. For the usual arithmetic:
Code: Select all
eastattacks := o ^ ((o ^ r) bytewiseMinus r);
and
Code: Select all
U64 eastAttacks(U64 rooks, U64 empty) {
empty = empty & notAFile;
rooks |= empty & (rooks << 1);
empty = empty & (empty << 1);
rooks |= empty & (rooks << 2);
empty = empty & (empty << 2);
rooks |= empty & (rooks << 4);
return notAFile& (rooks << 1);
}
leave identical results, empty == ~o, and rooks == r
Other directions look that way:
Code: Select all
U64 noWeAttacks(U64 bishops, U64 empty) {
empty = empty & notHFile;
bishops |= empty & (bishops << 7);
empty = empty & (empty << 7);
bishops |= empty & (bishops << 14);
empty = empty & (empty << 14);
bishops |= empty & (bishops << 28);
return notHFile & (bishops << 7);
}
Independently, Kogge-Stone stuff (as well as o ^ (o - 2r)) can applied with SSE2 SIMD-instructions on vectors of two bitboards, or even a xmm-register pair (2*128 bit) for quad bitboards. Intel has already announced Advanced Vector Extensions with 256 bit SIMD registers, as vectors of 32 bytes, 16 shorts, 8 ints or 4 bitboards. Sliding piece attacks without _any_ lookup, pure register computation, suited to build attack getters by combinatorial logic and (wired) shifts within three/four cycles of a sequential logic.
I really use this code twice with for a quad bitboard:
Code: Select all
__m128i eastAttacks (__m128i occ, __m128i rooks) {
__m128i tmp;
occ = _mm_or_si128 (occ, rooks); // make rooks member of occupied
tmp = _mm_xor_si128(occ, rooks); // occ - rooks
tmp = _mm_sub_epi8 (tmp, rooks); // occ - 2*rooks
return _mm_xor_si128(occ, tmp); // occ ^ (occ - 2*rooks)
}
which schedules in that way in assembly and improves ipc
Code: Select all
por xmm0, xmm1 ; ensure sliders are really member of occupied
por xmm4, xmm5 ; ensure sliders are really member of occupied
movdqa xmm2, xmm0 ; o
movdqa xmm6, xmm4 ; o
pxor xmm0, xmm1 ; o - r
pxor xmm4, xmm5 ; o - r
psubb xmm0, xmm1 ; o - 2r (bytewise)
psubb xmm4, xmm5 ; o - 2r (bytewise)
pxor xmm0, xmm2 ; o ^ (o - 2r) -> bb1 : bb0 east attacks
pxor xmm4, xmm6 ; o ^ (o - 2r) -> bb3 : bb2 east attacks
but 2*seven SSE2 Kogge Stones for the other 7 directions. The nice thing with that (xmm) register intensive stuff - it can hide the latency of the (prefetched) hash-probe.