New Architectures in Computer Chess

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New Architectures in Computer Chess

Post by bob »

Gian-Carlo Pascutto wrote:
bob wrote: I can't imagine using a different representation in the software part of the search, then converting to something useful in the FPGA.
This would be quite reasonable if the top few plies run in software. You can maintain the board in whatever is convenient there and then blast the position to the FPGA for the last few plies. No requirement to have the same board representation.

It could make sense to use this on the software side, but I agree with you it makes no sense to use it on the FPGA. And the software representation shouldn't have much effect on speed, because why have the FPGA otherwhise.
except that this is not the way the belle/deep-thought engine worked. There was no convenient way to stuff a position into the hardware, the hardware physically maintained the chess position. Ken's software search still used the hardware make/unmake/evaluate facilities. Reading the paragraph in the thesis leaves you with a "What???"

BTW the "hardware representation" is really an 8x8 "matrix" where there is a small "state" for each individual square.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New Architectures in Computer Chess

Post by bob »

Zach Wegner wrote:
bob wrote:Not sure what you mean by "my paper"? My copy of the thesis? This is a bound copy that was sent to me, printed by "Gildeprint". Last numbered page is 153.

The FPGA part is certainly cloned from Belle. Which means the board representation is fixed there. I can't imagine using a different representation in the software part of the search, then converting to something useful in the FPGA. The basic idea is that the FPGA hardware maintains _the_ board. It has things like Make, Unmake, Evaluate, and of course Search. It is not so easy to stuff a position into one of these things and really doesn't make any sense from a design perspective either.

I could be wrong, of course, but varying from the Belle approach would seem to be _very_ difficult.
Belle-style is certainly not the only feasible way to do hardware move generation. Some bitboard approaches work in the same FSM manner, while being a bit more flexible (and perhaps even easier to implement in hardware). Gerd's quad bitboard approach comes to mind, as well as Steffan Westcott's similar ideas.
It is difficult to think of a more traditional way that would fit into an FPGA design. But that's not the issue here as Hydra _is_ based on the belle /deep-* design exactly. The "quad-bitboard" approach doesn't fit well in my way of thinking. The internal representation used in belle was not a bitboard, it had 64 individual squares with local hardware for each. Plus the traditional "hardware tree" to collapse N results into 1 in log2(N) cycles.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: New Architectures in Computer Chess

Post by Gerd Isenberg »

bob wrote:It is difficult to think of a more traditional way that would fit into an FPGA design. But that's not the issue here as Hydra _is_ based on the belle /deep-* design exactly. The "quad-bitboard" approach doesn't fit well in my way of thinking. The internal representation used in belle was not a bitboard, it had 64 individual squares with local hardware for each. Plus the traditional "hardware tree" to collapse N results into 1 in log2(N) cycles.
I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.

MakeMove is branchless - the total position, that is quad-bitboard, hashkeys, ep, castle, 50-move and piece counters are 64 bytes:

Code: Select all

position.reversibleAspects   ^= move.xor();
position.irreversibleAspects += move.inc();
position.irreversibleAspects &= move.mask();
For eval may be some arrays of superfast DA-converters, and an analogue superfast neural network ;-)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New Architectures in Computer Chess

Post by bob »

Gerd Isenberg wrote:
bob wrote:It is difficult to think of a more traditional way that would fit into an FPGA design. But that's not the issue here as Hydra _is_ based on the belle /deep-* design exactly. The "quad-bitboard" approach doesn't fit well in my way of thinking. The internal representation used in belle was not a bitboard, it had 64 individual squares with local hardware for each. Plus the traditional "hardware tree" to collapse N results into 1 in log2(N) cycles.
I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.

MakeMove is branchless - the total position, that is quad-bitboard, hashkeys, ep, castle, 50-move and piece counters are 64 bytes:

Code: Select all

position.reversibleAspects   ^= move.xor();
position.irreversibleAspects += move.inc();
position.irreversibleAspects &= move.mask();
For eval may be some arrays of superfast DA-converters, and an analogue superfast neural network ;-)
The problem is, the Belle design doesn't work like that. There are essentially 64 squares, each of which has data (piece on the square) and some hardware that can generate the attacks from that square. There is a priority-encoding mechanism used to answer the question "which is the most valuable piece attacked by any white piece (or black piece of course)?" or "which is the least-valuable piece attacking this specific square?" And they do their thing in parallel and then the results get de-multiplexed to a single value via some cute priority stuff.

bitboards don't fit that model, making this a _complete_ redesign of the belle machine, from the ground up. Which is why I said this idea makes no sense. The Belle design is a perfect fit for an FPGA approach, and it gets its speed from the inherent parallelism of each square being independently dealt with. Bitboards is like apples-to-oranges in the comparison. Yes, a bitboard hardware engine is certainly doable. Whether it would be faster or slower than the Belle approach would take a lot of thought. But the complete re-design makes the idea silly and I have no idea why that appears in this thesis.
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: New Architectures in Computer Chess

Post by Zach Wegner »

bob wrote:
Gerd Isenberg wrote:
bob wrote:It is difficult to think of a more traditional way that would fit into an FPGA design. But that's not the issue here as Hydra _is_ based on the belle /deep-* design exactly. The "quad-bitboard" approach doesn't fit well in my way of thinking. The internal representation used in belle was not a bitboard, it had 64 individual squares with local hardware for each. Plus the traditional "hardware tree" to collapse N results into 1 in log2(N) cycles.
I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.

MakeMove is branchless - the total position, that is quad-bitboard, hashkeys, ep, castle, 50-move and piece counters are 64 bytes:

Code: Select all

position.reversibleAspects   ^= move.xor();
position.irreversibleAspects += move.inc();
position.irreversibleAspects &= move.mask();
For eval may be some arrays of superfast DA-converters, and an analogue superfast neural network ;-)
The problem is, the Belle design doesn't work like that. There are essentially 64 squares, each of which has data (piece on the square) and some hardware that can generate the attacks from that square. There is a priority-encoding mechanism used to answer the question "which is the most valuable piece attacked by any white piece (or black piece of course)?" or "which is the least-valuable piece attacking this specific square?" And they do their thing in parallel and then the results get de-multiplexed to a single value via some cute priority stuff.

bitboards don't fit that model, making this a _complete_ redesign of the belle machine, from the ground up. Which is why I said this idea makes no sense. The Belle design is a perfect fit for an FPGA approach, and it gets its speed from the inherent parallelism of each square being independently dealt with. Bitboards is like apples-to-oranges in the comparison. Yes, a bitboard hardware engine is certainly doable. Whether it would be faster or slower than the Belle approach would take a lot of thought. But the complete re-design makes the idea silly and I have no idea why that appears in this thesis.
You're saying bitboards wouldn't work because they're not parallel? :) That's a new one.

In a bitboard design like this, you wouldn't have bitscans or movelists, but rather just move sets, i.e. attack bitboards. The MVV/LVA phase is basically this:

Code: Select all

for (enemy piece from queen to pawn)
{
   for (friendly piece from pawn to king)
     if (friendly piece attacks enemy piece)
       generate capture;
}
...with those loops of course unrolled in the hardware. Very parallel.

You could do some simple SEE approximations as well, to only try "good" captures in the first phase (PxQ, Qx undefended P). Seems like it would require less logic than the Belle method, and be faster too.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New Architectures in Computer Chess

Post by bob »

Zach Wegner wrote:
bob wrote:
Gerd Isenberg wrote:
bob wrote:It is difficult to think of a more traditional way that would fit into an FPGA design. But that's not the issue here as Hydra _is_ based on the belle /deep-* design exactly. The "quad-bitboard" approach doesn't fit well in my way of thinking. The internal representation used in belle was not a bitboard, it had 64 individual squares with local hardware for each. Plus the traditional "hardware tree" to collapse N results into 1 in log2(N) cycles.
I would certainly consider a quad-bitboard - since it fits well in my way of thinking (I actually use it in software as one and only board representation). Eight Kogge-Stone hardware adders for each sliding direction, to have all sliding and "pseudo sliding attacks" (each distinct vertical nibble != 0 is considered a slider) in let say four cycles.

MakeMove is branchless - the total position, that is quad-bitboard, hashkeys, ep, castle, 50-move and piece counters are 64 bytes:

Code: Select all

position.reversibleAspects   ^= move.xor();
position.irreversibleAspects += move.inc();
position.irreversibleAspects &= move.mask();
For eval may be some arrays of superfast DA-converters, and an analogue superfast neural network ;-)
The problem is, the Belle design doesn't work like that. There are essentially 64 squares, each of which has data (piece on the square) and some hardware that can generate the attacks from that square. There is a priority-encoding mechanism used to answer the question "which is the most valuable piece attacked by any white piece (or black piece of course)?" or "which is the least-valuable piece attacking this specific square?" And they do their thing in parallel and then the results get de-multiplexed to a single value via some cute priority stuff.

bitboards don't fit that model, making this a _complete_ redesign of the belle machine, from the ground up. Which is why I said this idea makes no sense. The Belle design is a perfect fit for an FPGA approach, and it gets its speed from the inherent parallelism of each square being independently dealt with. Bitboards is like apples-to-oranges in the comparison. Yes, a bitboard hardware engine is certainly doable. Whether it would be faster or slower than the Belle approach would take a lot of thought. But the complete re-design makes the idea silly and I have no idea why that appears in this thesis.
You're saying bitboards wouldn't work because they're not parallel? :) That's a new one.
No, I said they would not work because they don't fit the parallel framework of the bell chess machine architecture.


In a bitboard design like this, you wouldn't have bitscans or movelists, but rather just move sets, i.e. attack bitboards. The MVV/LVA phase is basically this:

Code: Select all

for (enemy piece from queen to pawn)
{
   for (friendly piece from pawn to king)
     if (friendly piece attacks enemy piece)
       generate capture;
}
...with those loops of course unrolled in the hardware. Very parallel.
You did catch my mention of "synchronous"? Which means each operation has to finish in a fixed amount of time. Loops don't happen. If you look at the belle design, the activities for a single node are fixed in length, each and every time. There is a loop to take one move at a time, but the loop is unlike anything we do in normal chess engines. It can't even search the hash move or killer move first, in fact. Because it doesn't have hash or killer moves and it doesn't generate moves like we do. That was the point here.


You could do some simple SEE approximations as well, to only try "good" captures in the first phase (PxQ, Qx undefended P). Seems like it would require less logic than the Belle method, and be faster too.
SEE is inherently non-synchronous since there are a variable number of pieces that can bear on a square, requiring loops which we do not want, since we always have to do the max number of loops to calculate the cycle time of the FSM.

Reading some of the belle papers would be helpful, it is very much unlike anything we do at present. Not because it is better, but because it is far faster on synchronous gate array designs. These FPGAs don't have "instructions" to work with.
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: New Architectures in Computer Chess

Post by Zach Wegner »

bob wrote:
In a bitboard design like this, you wouldn't have bitscans or movelists, but rather just move sets, i.e. attack bitboards. The MVV/LVA phase is basically this:

Code: Select all

for (enemy piece from queen to pawn)
{
   for (friendly piece from pawn to king)
     if (friendly piece attacks enemy piece)
       generate capture;
}
...with those loops of course unrolled in the hardware. Very parallel.
You did catch my mention of "synchronous"? Which means each operation has to finish in a fixed amount of time. Loops don't happen. If you look at the belle design, the activities for a single node are fixed in length, each and every time. There is a loop to take one move at a time, but the loop is unlike anything we do in normal chess engines. It can't even search the hash move or killer move first, in fact. Because it doesn't have hash or killer moves and it doesn't generate moves like we do. That was the point here.
Yes, when I say that the loops are unrolled, I mean completely unrolled, with no breaking. That is, it tests all victims/attackers in parallel. Then there would be a similar priority mechanism, which would also extract the first bit of the victim and attacker bitboards (with the x&x-1 trick).

I suppose it would look something like this:

Code: Select all

Make victim bitboards by &ing each opp. piece with all of our attacks
Find highest attacked piece out of 5, extract the first bit (to)
Get attacker bitboards by anding all piece bitboards with the victim bit
Find lowest attacking piece out of 6, extract the first bit (from)
Extract victim's bit from attacker's attack set, to make sure we generate a different move next cycle
So 11 ands, two priority encoders, two LSB isolations, and then an xor to get rid of the move.
SEE is inherently non-synchronous since there are a variable number of pieces that can bear on a square, requiring loops which we do not want, since we always have to do the max number of loops to calculate the cycle time of the FSM.
I just said a simple SEE approximation. You can find whether victim pieces are defended/undefended, and if the attacker is more valuable than the victim, we can assume it's a bad capture and save it until the end. And by "saving" I mean we make two simultaneous passes over the MVV/LVA logic, one with "safe" captures, one without, and we only take the bad capture if there's no good ones. Not quite as good as real SEE, but certainly better than MVV/LVA.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New Architectures in Computer Chess

Post by bob »

Zach Wegner wrote:
bob wrote:
In a bitboard design like this, you wouldn't have bitscans or movelists, but rather just move sets, i.e. attack bitboards. The MVV/LVA phase is basically this:

Code: Select all

for (enemy piece from queen to pawn)
{
   for (friendly piece from pawn to king)
     if (friendly piece attacks enemy piece)
       generate capture;
}
...with those loops of course unrolled in the hardware. Very parallel.
You did catch my mention of "synchronous"? Which means each operation has to finish in a fixed amount of time. Loops don't happen. If you look at the belle design, the activities for a single node are fixed in length, each and every time. There is a loop to take one move at a time, but the loop is unlike anything we do in normal chess engines. It can't even search the hash move or killer move first, in fact. Because it doesn't have hash or killer moves and it doesn't generate moves like we do. That was the point here.
Yes, when I say that the loops are unrolled, I mean completely unrolled, with no breaking. That is, it tests all victims/attackers in parallel. Then there would be a similar priority mechanism, which would also extract the first bit of the victim and attacker bitboards (with the x&x-1 trick).
You can't cleanly unroll a loop that has a variable number of iterations, you have to unroll it for the max. Which fixes the minimum cycle time at something that is nowhere near optimal. That's why Belle did not use this approach, otherwise everything is stretched to worst-case and the cycle time is no better than normal hardware because the normal hardware _can_ exit the loops early.

I suppose it would look something like this:

Code: Select all

Make victim bitboards by &ing each opp. piece with all of our attacks
Find highest attacked piece out of 5, extract the first bit (to)
Get attacker bitboards by anding all piece bitboards with the victim bit
Find lowest attacking piece out of 6, extract the first bit (from)
Extract victim's bit from attacker's attack set, to make sure we generate a different move next cycle
So 11 ands, two priority encoders, two LSB isolations, and then an xor to get rid of the move.
You forgot a few "small details". We don't have "all attackers" so that gets computed in this process.
SEE is inherently non-synchronous since there are a variable number of pieces that can bear on a square, requiring loops which we do not want, since we always have to do the max number of loops to calculate the cycle time of the FSM.
I just said a simple SEE approximation. You can find whether victim pieces are defended/undefended, and if the attacker is more valuable than the victim, we can assume it's a bad capture and save it until the end. And by "saving" I mean we make two simultaneous passes over the MVV/LVA logic, one with "safe" captures, one without, and we only take the bad capture if there's no good ones. Not quite as good as real SEE, but certainly better than MVV/LVA.
And again, now we have an even longer cycle time. You really do have to forget the concept of "loops" if you are going to do an FPGA implementation. "Two passes" is a no-no. Because this is done in one cycle, so the cycle has to be long enough to do both, which slowly kills performance as this is applied to each loop above you try to "unroll".

Ken's design _is_ an elegant solution to the problem, even if it gives up a lot of concessions (no killer moves, no hash move, No move ordering other than Searching all captures first in MVV/LVA order. No history counters/moves. Hsu and then Chrilly copied that design for a reason, rather than redoing it from scratch.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: New Architectures in Computer Chess

Post by Gerd Isenberg »

bob wrote: And again, now we have an even longer cycle time. You really do have to forget the concept of "loops" if you are going to do an FPGA implementation. "Two passes" is a no-no. Because this is done in one cycle, so the cycle has to be long enough to do both, which slowly kills performance as this is applied to each loop above you try to "unroll".

Ken's design _is_ an elegant solution to the problem, even if it gives up a lot of concessions (no killer moves, no hash move, No move ordering other than Searching all captures first in MVV/LVA order. No history counters/moves. Hsu and then Chrilly copied that design for a reason, rather than redoing it from scratch.
Kogge-Stone might be unrolled that way: Parallel Simd attacks in one particular directions require four (three if sliders are already member of occupancy) simple instructions, xor, sub, xor.

Code: Select all

__m128i eastAttacks (__m128i occ, __m128i rooks) {
   __m128i tmp;
   occ  = _mm_or_si128 (occ, rooks);  //  make rooks member of occupied
   tmp  = _mm_xor_si128(occ, rooks);  // occ - rooks
   tmp  = _mm_sub_epi8 (tmp, rooks);  // occ - 2*rooks
   return _mm_xor_si128(occ, tmp);    // occ ^ (occ - 2*rooks)
}
If we implement a special sub instruction on the FPGA, which borrows from the other seven directions, aligned on the appropriate lines (like SIMD-wise bytes for ranks), all dirs may processed in parallel. Union and set-wise SEE is only about combinatorial logic, where multiple target sets might be processes in parallel as well.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New Architectures in Computer Chess

Post by bob »

Gerd Isenberg wrote:
bob wrote: And again, now we have an even longer cycle time. You really do have to forget the concept of "loops" if you are going to do an FPGA implementation. "Two passes" is a no-no. Because this is done in one cycle, so the cycle has to be long enough to do both, which slowly kills performance as this is applied to each loop above you try to "unroll".

Ken's design _is_ an elegant solution to the problem, even if it gives up a lot of concessions (no killer moves, no hash move, No move ordering other than Searching all captures first in MVV/LVA order. No history counters/moves. Hsu and then Chrilly copied that design for a reason, rather than redoing it from scratch.
Kogge-Stone might be unrolled that way: Parallel Simd attacks in one particular directions require four (three if sliders are already member of occupancy) simple instructions, xor, sub, xor.

Code: Select all

__m128i eastAttacks (__m128i occ, __m128i rooks) {
   __m128i tmp;
   occ  = _mm_or_si128 (occ, rooks);  //  make rooks member of occupied
   tmp  = _mm_xor_si128(occ, rooks);  // occ - rooks
   tmp  = _mm_sub_epi8 (tmp, rooks);  // occ - 2*rooks
   return _mm_xor_si128(occ, tmp);    // occ ^ (occ - 2*rooks)
}
If we implement a special sub instruction on the FPGA, which borrows from the other seven directions, aligned on the appropriate lines (like SIMD-wise bytes for ranks), all dirs may processed in parallel. Union and set-wise SEE is only about combinatorial logic, where multiple target sets might be processes in parallel as well.
I would not disagree. But that is a _long_ way from a simple change to add a new board representation to the already-existing Belle design. That would be a complete rewrite. Another issue is memory. The belle approach produces moves, one at a time, because it is unattractive to get a wad of moves as you now have to pick 'em off one at a time. I will add that a "multiply" is not exactly FPGA-friendly since it doesn't have an intrinsic instruction set. So you would get to design a finite state machine to do that multiply, at the point where you need it, which would be messy.