Notes on a custom 8x8 chess processing array

sje · Post by **sje** » Tue Aug 12, 2014 6:14 pm

Notes on a custom 8x8 chess processing array

From back in the late 1980s, I started thinking about designing and building a custom 8x8 chess processing array. The array would have 64 PEs (Processing Elements) with four way connectivity organized as a torus (every PE has exactly four connections to its orthogonally adjacent PEs with toroidal wrap at the border ranks and files). The idea is to leverage relatively low cost , small SBCs (Single Board Computers) to make a test bed and possibly a production machine well suited for experiments with closely coupled chess algorithms.

In the Early Days, there were only a few candidate boards available, and the only place in the pre-web days to find out about these were the advertising pages in _Circuit Cellar Ink_, a magazine spinoff from the _Byte_ magazine column of the same name.

Three candidate boards from back then were the BASIC Stamp, the MIT Miniboard, and an Intel 8051 board preloaded with interpretive BASIC. I bought one of each of these and maybe a few other makes; the details are not remembered well today.

The BASIC Stamp was the least expensive, but also very limited in both speed and storage. The Miniboard was mostly aimed at roboticists, while the 8051 board (there were many different flavors) was rather more expensive; it had lots of storage but not a whole lot of speed.

So I tried a design based on the Zilog z80 combined with four Zilog 8 bit parallel I/O chips for handling the connectivity. Also involved were some latches and some interrupt logic. The result would have worked, but the cost was relatively high.

Ultimately, what doomed this and many other SBC array ideas was that the power/price ratio of a commodity PC was just too good and a simulation using PCs of the SBC array would outperform the physical array version.

--------

Over the past couple of years, we've seen some promising new SBCs: various Arduino models, the two different Raspberry Pi models, and the BeagleBone Black -- I have at least one of each of these. Yet in each case, it appears that a simulation using a high end PC beats the real thing.

All is not lost, however. Recently I've been considering an FPGA approach using one of the experimenter-friendly FPGA boards. This has the best chance of working, although the cost per board is relatively high with prices starting at US$80 and rising quickly with higher capacity FPGA chips.

One goal would be to implement on each board a full bitboard position structure with the complete 64x64 bit array for sq/sq attack status. Each chip would have one position environment plus the code to generate, execute, and retract moves doing full updating of the position data. This by itself would make an interesting perft() engine.

For playing chess, an 8x8 array of these boards could be organized as a one chess square to one FPGA board layout. With the boards closely synchronized, an elaborate positional evaluation based on calculating many components in parallel for all the squares and then summing these for a single result.

--------

One candidate FPGA board: http://www.makershed.com/products/mojo- ... ment-board

ZirconiumX · Post by **ZirconiumX** » Tue Aug 12, 2014 6:23 pm

Something like that would be very powerful for sure, and I reckon you could put some serious evaluation in there before you got a massive slowdown. The bottleneck would be transferring from PC to the CPA, because you have to prime each node with information, process it, then from each node, get the result.

I reckon a single FPGA could do a sixeable amount of that in parallel, and transfer would be a lot faster. It would be even faster if you performed a search in there, trying to squeeze use out of every single logic element in the array.

They would probably be much easier to use calculating perft though.

Matthew:out

sje · Post by **sje** » Tue Aug 12, 2014 7:11 pm

ZirconiumX wrote:The bottleneck would be transferring from PC to the CPA, because you have to prime each node with information, process it, then from each node, get the result.

Each PE knows where it is, so each would know how to participate in an avalanche, either merging or splitting data from/to adjacent nodes. Only one node needs to be able to communicate with the outside world, and that node can talk to all the other nodes using a maximum of 7+7 steps. (Or fewer steps if wired as a torus.)

Belle had something like this with an overlay of four 4x4 priority networks. Each quadrant of 16 cells was connected to of the four networks, then each of the networks was connected to a single 2x2 network, and the output of that 2x2 represented the best/worst/whatever value of some cell data taken form all 64 cells in only three steps.

However, many boards have an Ethernet port built-in, so a better approach might be to have a nine port Ethernet switch for each rank (or file) and then a ninth switch which connected those eight switches with one port left over for talking to the outside world. This is attractive because it's simple and because it takes advantage of fairly decent network hardware and also better handles the case where not all PEs finish a calculation at the same time. Hubs could be used instead of switches; cheaper, but with more collisions. Whit switches, UDP packets could be used which have a third of the time cost as do TCP packets.

sje · Post by **sje** » Tue Aug 12, 2014 9:54 pm

Note that it's not a requirement to build the entire 8x8 array to start programming and testing. Any smaller rectangular group with integral power of two sides will work with the physical array tessellated onto an 8x8 virtual array with time domain multiplexing.

Notes on a custom 8x8 chess processing array

Notes on a custom 8x8 chess processing array

Re: Notes on a custom 8x8 chess processing array

Re: Notes on a custom 8x8 chess processing array

Re: Notes on a custom 8x8 chess processing array