New Architectures in Computer Chess

Desperado · Post by **Desperado** » Thu Aug 27, 2009 8:41 pm

Hi Robert,

i dont know anything about FPGA configurations (programming)
(sorry

)

bob wrote: I will add that a "multiply" is not exactly FPGA-friendly since it doesn't have an intrinsic instruction set. So you would get to design a finite state machine to do that multiply, at the point where you need it, which would be messy.

but doesnt it handle shifts very well ? if this would be the case i dont
see any problem for mul 2 operations (used in Gerds example) ?

Michael

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 27, 2009 9:36 pm

Desperado wrote:Hi Robert,

i dont know anything about FPGA configurations (programming)
(sorry )

bob wrote: I will add that a "multiply" is not exactly FPGA-friendly since it doesn't have an intrinsic instruction set. So you would get to design a finite state machine to do that multiply, at the point where you need it, which would be messy.
but doesnt it handle shifts very well ? if this would be the case i dont
see any problem for mul 2 operations (used in Gerds example) ?

Michael

The mul 2 is done by pxor, psub.

Code: Select all

; input&#58; xmm1 - rooks/queens e.g. white&#58;black
;        xmm0 - occupany&#58;occupany
por      xmm0, xmm1 ; ensure sliders are really member of occupied
movdqa   xmm2, xmm0 ; o
pxor     xmm0, xmm1 ; o - r
psubb    xmm0, xmm1 ; o - 2r &#40;bytewise&#41;
pxor     xmm0, xmm2 ; o ^ &#40;o - 2r&#41; -> white east attacks &#58; black east attacks

But Bob is right, even if there is no mul (as in magics), you somehow need to build your own instructions and registers, and it is probably not that simple to build an 'add-instruction' which "overflows" in the other seven directions, to west, and north/south along the 15 diagonals, anti-diagonals and 8 files. One may likely end up in creating seven kogge-stone adders with different overflow carry wiring for that...

bob · Post by **bob** » Thu Aug 27, 2009 10:11 pm

Desperado wrote:Hi Robert,

i dont know anything about FPGA configurations (programming)
(sorry )

bob wrote: I will add that a "multiply" is not exactly FPGA-friendly since it doesn't have an intrinsic instruction set. So you would get to design a finite state machine to do that multiply, at the point where you need it, which would be messy.
but doesnt it handle shifts very well ? if this would be the case i dont
see any problem for mul 2 operations (used in Gerds example) ?

Michael

Yes and no. First, it can shift easily enough since you can route any bit anywhere you want it. It doesn't handle memory references very well, since they introduce huge latencies and an FPGA is not a pipelined architecture at all.

Bottom line is the Belle design is quite critically designed to fit the old PLA and now FPGA hardware facilities to minimize the cycle time which maximizes the nodes per second (the ultimate goal of doing an FPGA implementation in the first place, of course).

But all of that aside, we are on this thesis topic where the statement was made that this different board representation was added to Hydra, and sped it up. To speed up hydra, the FPGA speed is the major limiting factor. The parallel search is the other main contributor. This data structure affects neither. So how this is going to make Hydra faster/stronger is a complete mystery to me.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Sat Aug 29, 2009 4:15 pm

bob wrote: I will add that a "multiply" is not exactly FPGA-friendly since it doesn't have an intrinsic instruction set. So you would get to design a finite state machine to do that multiply, at the point where you need it, which would be messy.

Not really.

1) Most modern FPGA's come with fast wide multiplier elements. (to make DSP operations faster)
2) The vendor usually provides macros that instantiate a multiplier element optimized to their particular architecture without any effort from the developer.

Now, I'm not gonna argue that you want to do bitboards on an FPGA, but multiplying isn't gonna be the problem to do that. It just makes no damn sense to use the multiplies: we use them on the PC because they allow bits to be reshuffled quickly. But the FPGA can reshuffle bits (almost) for free.

bob · Post by **bob** » Sat Aug 29, 2009 8:36 pm

Gian-Carlo Pascutto wrote:
bob wrote: I will add that a "multiply" is not exactly FPGA-friendly since it doesn't have an intrinsic instruction set. So you would get to design a finite state machine to do that multiply, at the point where you need it, which would be messy.
Not really.

1) Most modern FPGA's come with fast wide multiplier elements. (to make DSP operations faster)
2) The vendor usually provides macros that instantiate a multiplier element optimized to their particular architecture without any effort from the developer.

Now, I'm not gonna argue that you want to do bitboards on an FPGA, but multiplying isn't gonna be the problem to do that. It just makes no damn sense to use the multiplies: we use them on the PC because they allow bits to be reshuffled quickly. But the FPGA can reshuffle bits (almost) for free.

128 bit multiplies? I believe that is what Gerd is doing.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Aug 29, 2009 9:35 pm

bob wrote: 128 bit multiplies? I believe that is what Gerd is doing.

Nope, setwise o ^ (o - 2r), where r (sliders) is subset of o (occupancy), SIMD-wise with two bitboards (f.i. white:black rooks|queens or white rooks : white queens in two xmm halfs) simultaneously inside one 128 bit xmm register. Since r is subset of o, the mul 2 of - 2r is done by xor (pxor == SSE2 instruction) and bytewise (rankwise) sub (psubb):

Code: Select all

eastattacks &#58;= o ^ (&#40;o ^ r&#41; - r&#41;;

If we could build reversed (west) and north/south rotated "subtractions" which borrow along the files and diagonals, we would have very cheap set-wise attack generations for multiple sliders (per bitboard) in all eight directions, with 3 (2.25) instructions for two bitboards and shared inner (o ^ r) or occupancy without sliders, for orthogonal rooks|queens or diagonal bishops|queens.

Code: Select all

westattacks &#58;= o ^ (&#40;o ^ r&#41; reversedMinus r&#41;;
...

Further elaboration on o ^ (o - 2r):
Subtracting a Rook from a Blocking Piece
Fill by Subtraction
East Attacks with SSE2

I think in hardware we may even build 8 fold alus based on xor and "directional sub", to perform all sliding directions at once - for vectors of let say four or eight bitboards.

bob · Post by **bob** » Sat Aug 29, 2009 10:30 pm

Gerd Isenberg wrote:
bob wrote: 128 bit multiplies? I believe that is what Gerd is doing.
Nope, setwise o ^ (o - 2r), where r (sliders) is subset of o (occupancy), SIMD-wise with two bitboards (f.i. white:black rooks|queens or white rooks : white queens in two xmm halfs) simultaneously inside one 128 bit xmm register. Since r is subset of o, the mul 2 of - 2r is done by xor (pxor == SSE2 instruction) and bytewise (rankwise) sub (psubb):
Code: Select all
eastattacks &#58;= o ^ (&#40;o ^ r&#41; - r&#41;;
If we could build reversed (west) and north/south rotated "subtractions" which borrow along the files and diagonals, we would have very cheap set-wise attack generations for multiple sliders (per bitboard) in all eight directions, with 3 (2.25) instructions for two bitboards and shared inner (o ^ r) or occupancy without sliders, for orthogonal rooks|queens or diagonal bishops|queens.
Code: Select all
westattacks &#58;= o ^ (&#40;o ^ r&#41; reversedMinus r&#41;;
...
Further elaboration on o ^ (o - 2r):
Subtracting a Rook from a Blocking Piece
Fill by Subtraction
East Attacks with SSE2

I think in hardware we may even build 8 fold alus based on xor and "directional sub", to perform all sliding directions at once - for vectors of let say four or eight bitboards.

I was basing my assumption on this:

I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.

Although I should have said 256 bit rather than 128 bit...

bob · Post by **bob** » Sat Aug 29, 2009 10:31 pm

Gerd Isenberg wrote:
bob wrote: 128 bit multiplies? I believe that is what Gerd is doing.
Nope, setwise o ^ (o - 2r), where r (sliders) is subset of o (occupancy), SIMD-wise with two bitboards (f.i. white:black rooks|queens or white rooks : white queens in two xmm halfs) simultaneously inside one 128 bit xmm register. Since r is subset of o, the mul 2 of - 2r is done by xor (pxor == SSE2 instruction) and bytewise (rankwise) sub (psubb):
Code: Select all
eastattacks &#58;= o ^ (&#40;o ^ r&#41; - r&#41;;
If we could build reversed (west) and north/south rotated "subtractions" which borrow along the files and diagonals, we would have very cheap set-wise attack generations for multiple sliders (per bitboard) in all eight directions, with 3 (2.25) instructions for two bitboards and shared inner (o ^ r) or occupancy without sliders, for orthogonal rooks|queens or diagonal bishops|queens.
Code: Select all
westattacks &#58;= o ^ (&#40;o ^ r&#41; reversedMinus r&#41;;
...
Further elaboration on o ^ (o - 2r):
Subtracting a Rook from a Blocking Piece
Fill by Subtraction
East Attacks with SSE2

I think in hardware we may even build 8 fold alus based on xor and "directional sub", to perform all sliding directions at once - for vectors of let say four or eight bitboards.

I was basing my assumption on this:

I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.

Although I should have said 256 bit rather than 128 bit...

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Aug 29, 2009 11:14 pm

bob wrote: I was basing my assumption on this:

I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.
Although I should have said 256 bit rather than 128 bit...

Yes, but that has nothing to do with multiplication as well, since it it only about using vectors of bitboards, each bitboard in parallel.

Kogge Stone can "emulate" the "directional sub" with shifts in the appropriate directions. For the usual arithmetic:

Code: Select all

eastattacks &#58;= o ^ (&#40;o ^ r&#41; bytewiseMinus r&#41;;

and

Code: Select all

U64 eastAttacks&#40;U64 rooks, U64 empty&#41; &#123;
   empty  = empty & notAFile;
   rooks |= empty & &#40;rooks << 1&#41;;
   empty  = empty & &#40;empty << 1&#41;;
   rooks |= empty & &#40;rooks << 2&#41;;
   empty  = empty & &#40;empty << 2&#41;;
   rooks |= empty & &#40;rooks << 4&#41;;
   return notAFile& &#40;rooks << 1&#41;;
&#125;

leave identical results, empty == ~o, and rooks == r

Other directions look that way:

Code: Select all

U64 noWeAttacks&#40;U64 bishops, U64 empty&#41; &#123;
   empty    = empty & notHFile;
   bishops |= empty & &#40;bishops <<  7&#41;;
   empty    = empty & &#40;empty   <<  7&#41;;
   bishops |= empty & &#40;bishops << 14&#41;;
   empty    = empty & &#40;empty   << 14&#41;;
   bishops |= empty & &#40;bishops << 28&#41;;
   return  notHFile & &#40;bishops <<  7&#41;;
&#125;

Independently, Kogge-Stone stuff (as well as o ^ (o - 2r)) can applied with SSE2 SIMD-instructions on vectors of two bitboards, or even a xmm-register pair (2*128 bit) for quad bitboards. Intel has already announced Advanced Vector Extensions with 256 bit SIMD registers, as vectors of 32 bytes, 16 shorts, 8 ints or 4 bitboards. Sliding piece attacks without _any_ lookup, pure register computation, suited to build attack getters by combinatorial logic and (wired) shifts within three/four cycles of a sequential logic.

I really use this code twice with for a quad bitboard:

Code: Select all

__m128i eastAttacks (__m128i occ, __m128i rooks&#41; &#123;
   __m128i tmp;
   occ  = _mm_or_si128 &#40;occ, rooks&#41;;  //  make rooks member of occupied
   tmp  = _mm_xor_si128&#40;occ, rooks&#41;;  // occ - rooks
   tmp  = _mm_sub_epi8 &#40;tmp, rooks&#41;;  // occ - 2*rooks
   return _mm_xor_si128&#40;occ, tmp&#41;;    // occ ^ &#40;occ - 2*rooks&#41;
&#125;

which schedules in that way in assembly and improves ipc

Code: Select all

por      xmm0, xmm1 ; ensure sliders are really member of occupied
por      xmm4, xmm5 ; ensure sliders are really member of occupied
movdqa   xmm2, xmm0 ; o
movdqa   xmm6, xmm4 ; o
pxor     xmm0, xmm1 ; o - r
pxor     xmm4, xmm5 ; o - r
psubb    xmm0, xmm1 ; o - 2r &#40;bytewise&#41;
psubb    xmm4, xmm5 ; o - 2r &#40;bytewise&#41;
pxor     xmm0, xmm2 ; o ^ &#40;o - 2r&#41; -> bb1 &#58; bb0  east attacks
pxor     xmm4, xmm6 ; o ^ &#40;o - 2r&#41; -> bb3 &#58; bb2  east attacks

but 2*seven SSE2 Kogge Stones for the other 7 directions. The nice thing with that (xmm) register intensive stuff - it can hide the latency of the (prefetched) hash-probe.

bob · Post by **bob** » Sun Aug 30, 2009 7:31 pm

I realized that "after the fact." I am just used to your often incomprehensible SSE code and was thinking along those lines.

New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess

Re: New Architectures in Computer Chess