FPGA chess

stegemma · Post by **stegemma** » Thu Nov 27, 2014 8:07 pm

I've done a very simple count on how fast an FPGA could be. A starting that board costs about 30$ has a 100 Mhz clock. If you can do any step in one clock cicle, you can do:

Code: Select all

1 - select the first move
2 - make the move
3 - next ply
...last ply:
4 - evaluate
5 - go back one ply
6 - undo last move

Of course this is over-simplified but it means that you need from 5 to 6 clock cycles per move. This is the minimal count, i think... but maybe there are some better design (i'm not an engineer) than the one i can imagine.

If we can do a full make/unmake in only 5 clock cycles, at 100 Mhz we can reach a 20M nodes per second or less (for the cheapest FPGA board that i've found). It is not so impressive, compared to today CPUs. What is interesting is that we can build multiple processor unit in one FPGA, that works in parallel. I don't know the FPGA limits and how many "co-processors" we can fit on it, because it depends on the complexity of our project and of the FPGA itself.

Even if this speed is not so impressive, it would be more than anything personally I can do in C++, for now.

matthewlai · Post by **matthewlai** » Fri Nov 28, 2014 1:47 am

stegemma wrote:I've done a very simple count on how fast an FPGA could be. A starting that board costs about 30$ has a 100 Mhz clock. If you can do any step in one clock cicle, you can do:
Code: Select all
1 - select the first move
2 - make the move
3 - next ply
...last ply:
4 - evaluate
5 - go back one ply
6 - undo last move
Of course this is over-simplified but it means that you need from 5 to 6 clock cycles per move. This is the minimal count, i think... but maybe there are some better design (i'm not an engineer) than the one i can imagine.

If we can do a full make/unmake in only 5 clock cycles, at 100 Mhz we can reach a 20M nodes per second or less (for the cheapest FPGA board that i've found). It is not so impressive, compared to today CPUs. What is interesting is that we can build multiple processor unit in one FPGA, that works in parallel. I don't know the FPGA limits and how many "co-processors" we can fit on it, because it depends on the complexity of our project and of the FPGA itself.

Even if this speed is not so impressive, it would be more than anything personally I can do in C++, for now.

That's why I don't believe it's a good idea to do alpha-beta in hardware.

On the other hand, eval() takes about 1/3 of the time in a chess engine for example, and it's very simple to parallelize. It should be possible to do a pretty complicated eval() in a few clock cycles.

With search, it's possible to search multiple children at the same time by mapping them into duplicated hardware, but then alpha-beta efficiency decreases, etc.

The clock/oscillator you have on the board actually has little to do with how fast you can clock your design. All modern FPGAs have integrated PLL circuits that allow you to generate a very wide range of clocks from a fixed input clock.

The actual maximum speed you can achieve with your design depends on design complexity (especially number of layers of logic between flip flops), FPGA speed grade, FPGA architecture, etc.

Most careful designs on low cost FPGAs can go to 200 MHz or so. At 250 and above you have to be extremely careful, making sure EVERYTHING is pipelined, etc.

Even at 20Mnps I would say that's still quite a feat, considering a low cost FPGA draws about 0.5W, and a CPU searching 20Mnps would be drawing on the order of 100W.

BeyondCritics · Post by **BeyondCritics** » Fri Nov 28, 2014 10:55 am

Thank you for explanation.
I have looked up the board you have recommended, but unfortunately i am absolutely new to this subject and i feel my first board should be something designed for beginners. Working with it should not be too time consuming either. Do you have something to recommend?

Joost Buijs · Post by **Joost Buijs** » Fri Nov 28, 2014 11:50 am

matthewlai wrote: That's why I don't believe it's a good idea to do alpha-beta in hardware.

On the other hand, eval() takes about 1/3 of the time in a chess engine for example, and it's very simple to parallelize. It should be possible to do a pretty complicated eval() in a few clock cycles.

The problem with doing evaluation only is that you have to send information about the current position to the device, the standard PC buses like PCI-Express are way too slow and have too much latency for this.

To be successful you have to design a very fast custom interface to the processor doing the search, this is the main reason earlier designs used a few plies of search in hardware.

You also need a very fast processor if you want to send each 25ns a new position to the device to evaluate.

It looks nice, but in practice it is hardly doable.

Maybe in the future it will, when Intel comes with a processor with an integrated FPGA.

stegemma · Post by **stegemma** » Fri Nov 28, 2014 11:56 am

I don't know about internal clock feature, different from external one, thanks for the information (i must study the FPGA architecture, for sure); this could means a doubled speed.

The consumed power is another interesting point in favor of FPGA and maybe even the feature to embed the "chess machine" in home made chessboard.

It becomes very interesting...

matthewlai · Post by **matthewlai** » Fri Nov 28, 2014 3:05 pm

Joost Buijs wrote: The problem with doing evaluation only is that you have to send information about the current position to the device, the standard PC buses like PCI-Express are way too slow and have too much latency for this.

To be successful you have to design a very fast custom interface to the processor doing the search, this is the main reason earlier designs used a few plies of search in hardware.

You also need a very fast processor if you want to send each 25ns a new position to the device to evaluate.

It looks nice, but in practice it is hardly doable.

Maybe in the future it will, when Intel comes with a processor with an integrated FPGA.

Yeap, we are specifically talking about tightly-coupled architectures (FPGA and CPU on the same chip). Either something like the Intel chip, or Xilinx Zynq (ARM core + FPGA fabric), or synthesizing a soft CPU on FPGA fabric, with custom instructions or memory mapped peripherals.

matthewlai · Post by **matthewlai** » Fri Nov 28, 2014 3:13 pm

BeyondCritics wrote:Thank you for explanation.
I have looked up the board you have recommended, but unfortunately i am absolutely new to this subject and i feel my first board should be something designed for beginners. Working with it should not be too time consuming either. Do you have something to recommend?

In that case I would recommend the Altera DE1 or DE2. The Cylone II FPGA on there is a bit dated, but they are very popular choices for intro-to-FPGA courses in universities all around the world, so there should be a lot of resources available online specifically for those boards.

It's $150 though, and the Digilent ZYBO board is only $190, and contains a very new and much more capable FPGA, with a 650 MHz dual core ARM Cortex-A9.

Be sure to look on eBay if you are looking for the DE1 or DE2. There should be students trying to sell their boards after they are done with their course.

Milos · Post by **Milos** » Fri Nov 28, 2014 5:40 pm

Joost Buijs wrote:The problem with doing evaluation only is that you have to send information about the current position to the device, the standard PC buses like PCI-Express are way too slow and have too much latency for this.

Not necessarily, check for IBM POWER8+CAPI solution. PCIe Gen 3 is integrated in the processor in so called coherence bus, so accelerator has the same address space as the main processor, i.e. no drivers and OS overhead and they participate in "locks" as normal threads yielding very low latency.
However, this is a high-end solution so certainly not for hobbyists and chess enthusiasts (well maybe for some quite rich chess enthusiasts

).

BeyondCritics · Post by **BeyondCritics** » Fri Nov 28, 2014 7:31 pm

That seems to fit the purpose perfectly, thank you.

Joost Buijs · Post by **Joost Buijs** » Fri Nov 28, 2014 10:12 pm

Milos wrote:
Joost Buijs wrote:The problem with doing evaluation only is that you have to send information about the current position to the device, the standard PC buses like PCI-Express are way too slow and have too much latency for this.
Not necessarily, check for IBM POWER8+CAPI solution. PCIe Gen 3 is integrated in the processor in so called coherence bus, so accelerator has the same address space as the main processor, i.e. no drivers and OS overhead and they participate in "locks" as normal threads yielding very low latency.
However, this is a high-end solution so certainly not for hobbyists and chess enthusiasts (well maybe for some quite rich chess enthusiasts ).

When you want to do this professionally there are some solutions but this is way out of reach for hobbyists.
I was always very enthusiastic about computer chess, and in the 38 years that I have been busy with it I spent about 70k on hardware mainly for this purpose.
Now that I'm retired I have to be careful where to spend my money on. So I have to step back a little.

Recently I bought a new mainboard with a 5960x and I suppose I have to live with this for the next 5 years.

FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess

Re: FPGA chess