Pi64: Raspberry Pi 2B 64 element bramble

sje · Post by **sje** » Sun Apr 05, 2015 6:21 pm

From the Brits, a real Pi64 example, although perhaps with different connectivity:

The Ethernet spanning tree appears to be 8x8 instead of 4x4x4.

http://www.gchq.gov.uk/press_and_media/ ... amble.aspx

sje · Post by **sje** » Fri Apr 10, 2015 3:49 am

Pi64 Plan Next Step

After some more investigation, I have found a 24 port, 1 GBps Ethernet switch model (Netgear JFS524) available at $65 each. Four of these could replace all twenty-one of the 5 port 100 MBps switches for less than an additional $200 total. This might be worth the cost just to avoid cable clutter and a bucket of wall wart power supplies. Even though a Raspberry Pi has its Ethernet interface limited to 100 MBps, there would still be a slight speed gain as the number of processing elements reachable on the same switch would increase from four to sixteen. I've had fairly good luck with Netgear kit with only one going sour over the past dozen years. I will note here that the power plug on these doesn't have the tightest fit, so a wandering house cat attracted by the blinking status LEDs can easily knock the unit offline.

----

There's more I/O than just the Ethernet port on a Raspberry Pi. These include an HDMI port, a camera port, and four USB ports. More interesting is a 40 pin GPIO header with 26 signals useable for connecting with one or more other Pi boards to do chit-chat at wire speed and without CPU overhead. Note: Belle needed 16 such signals to implement its cell-to-cell connectivity across each 8 plus 8 chess directions. This could be very useful for SquareMap mode, although not so much for NodeMap mode.

----

I'll be ordering one of the new quad core, 1 GiB, 900 MHz Raspberry Pi 2B models for experiments soon. If warranted, I'll then get three more to see what can be expected from a very simple grid. After that, maybe on to 16 boards (64 cores) or all 64 boards (256 cores) total.

----

Idea: If I do get a 64 element bramble running, then should I greedily keep it all to myself? I could post a few video clips of the many LEDs flashing in wondrous synchronization. I could mount the assembly on my living room ceiling so only house guests and the cats could gaze upon my work.

But better would be to offer occasional use of the bramble to my fellow authors for their experiments. I'd supply a C/C++ library to handle all IP based inter-element communication and other likely required routines. I'd also set up some kind of bridge through the firewall so that the bramble could be remotely controlled over the net in real time. An applicant could send me the C/C++ source to their Pi program for me to compile (don't worry, I won't look at it and will erase all when requested). Copies of data files would also be sent, remembering that the available SD card storage is somewhat limited, and I'd place these along with the program executable on each element. Preferably, each element would have the same set of data files so that I don't mess up and mismatch files with the cards. I'd supply a library routine which would return an integer 0..63 giving the element ID, so each program instance would know who it was and how to address other bramble elements (host names pi00..pi63). I've also got a spare USB2 320 GB hard drive which could be connected to one of the elements.

It might even be possible to offer periodic reserve time on the bramble if an author has a promising engine running reliably and wants to connect it via their own controlling program to an Internet chess server.

For a remote operator using a Mac/iPhone/iPad with FaceTime videochat capability, a view of the bramble in operation could be transmitted.

----

Another resource: https://en.wikipedia.org/wiki/Beowulf_cluster

Beowulf tools present on some Linux distributions might not be available on the bramble's Raspbian default distribution, but perhaps could be ported.

sje · Post by **sje** » Sun Apr 12, 2015 12:01 pm

Pi64 Local Network

Using static IP addresses eliminates the need for a router to handle DHCP assignments and also lets each processing element know its topological relationship with any other element. My idea is to use IPv4 instead of IPv6 to lessen latency overhead. Each element will have an /etc/hosts file which will include a host name entry for each element:

pe00 10.0.1.100 # SquareMap mode a1
pe01 10.0.1.101 # SquareMap mode b1
...
pe63 10.0.1.163 # SquareMap mode h8

When each processing element boots, it will follow the typical Unix initialization sequence. The SD cards used by the elements will all have identical contents except for identity (file /etc/hostname) and addressing information (file /etc/network/interfaces).

Using four 24 port 1 Gbps Ethernet switches, the element topology will be:

pe00..pe15: switch 0
pe16..pe31: switch 1
pe32..pe47: switch 2
pe48..pe63: switch 3

A fifth Ethernet switch with five 1 Gbps ports will connect to each of the other four switches and also to the house LAN. This will make each element equidistant from the controlling computer, although this is not too important.

So, each element will have 15 elements connected by a path which passes through only one switch and 48 elements connected by a path which passes through three switches. All elements connect to the house LAN or controlling computer through two switches.

Each of the 24 port switches will have seven unused ports. The single five port switch will have no unused ports. To reduce costs, all switches are of the unmanaged variety and do not provide POE (Power Over Ethernet).

----

Each processing element will have the usual daemons for handling incoming ssh and rsync requests. Regular maintenance will be handled via local automated scripts and also by commands issued from the controlling computer via ssh. Because the grid may be used by a non-local controlling computer, the elements will be living in UTC time. The operating system and utilities on each element will be updated weekly via the apt-get program.

Each non-local user will have their own user ID, replicated on each element; this is to help ensure privacy, security, and non-interference among users.

While each processing element will have the capability to send data via the net to non-local users, it is expected that the amount of such data and the number of requests will be relatively small. One idea here is to have a single designated node (e.g., pe00) act as a collection and distribution agent for the entire grid and also as the lone communicator with the controlling computer.

sje · Post by **sje** » Thu Apr 16, 2015 12:20 am

Pi64: Benchmark estimates

Without having a Raspberry Pi model 2B available for testing, any benchmark estimates made may not be very accurate. But I'll try anyway.

On a single core 1 GHz ARM7 CPU (BeagleBone Black), my program Oscar achieves a throughput of 580 KHz (580,000 nodes per second) running the BusyFEN perft(5) benchmark (no transposition assistance, no bulk counting).

A Raspberry Pi 2B has a quad core 900 MHz ARM7 CPU. It should show performance numbers some 3.6 times that of the above: about 2.09 MHz running four Oscar instances.

A Pi64 grid might have a figure 64 times higher; about 134 MHz. In comparison, my quad core 3.4 GHz Intel Core i5 (i5-4670) box can run four Oscar instances and show a 25 MHz throughput.

So, the above Core i5 box should show a throughput about 12 times faster than a single Raspberry Pi 2B. And a Pi64 grid should show a throughput about 5.4 times as fast as the Core i5 box.

mvk · Post by **mvk** » Thu Apr 16, 2015 12:10 pm

I usually evaluate bulk performance by calculating "nodes/currency unit" over 3 year operation, assuming a 90% write-off over those years and full load.

Do you have any such indication for this setup compared to the i5?

sje · Post by **sje** » Fri Apr 17, 2015 12:04 am

The Core i5 machine capital cost is about 40% of the Pi64 capital cost. So on that basis, and using the other available numbers, the Pi64 capital cost per throughput unit is about 46% that of the Core i5. Note: much guesswork here.

mvk · Post by **mvk** » Thu Apr 30, 2015 12:32 am

sje wrote:The Core i5 machine capital cost is about 40% of the Pi64 capital cost. So on that basis, and using the other available numbers, the Pi64 capital cost per throughput unit is about 46% that of the Core i5. Note: much guesswork here.

I got interested and dusted off my old pi model B rev1 (256MB) to get some performance numbers. As a result, it is playing on FICS now as Bliep(C), in turbo mode (1GHz) from a 4GB SD card. Very nice project, I should have done that much earlier.

What I notice is the low number of instructions per cycle compared to the x86 based systems. The difference is about a factor 5. Assuming your dollar figure as euros (normally a good estimate here), 3 Watts, 4x the NPS as my rev1 (because 4 cores), and plug these numbers into my sheet, the pi2 will fall just a little short compared to the x86 systems. Everything looks good (cost, watts), but instructions per cycle is the caveat.

Still it would be performing much better than I expected, and the difference is small enough to fall within error bounds of the estimate. I think I'm going to order a pi2 to find out more...

(Mind the x86 numbers are a bit outdated, so that works in the other direction.)

Joost Buijs · Post by **Joost Buijs** » Thu Apr 30, 2015 10:28 am

I see you are still using the data you took once on my old 980x computer with a very outdated GNU compiler on it.
Maybe you have to run the benchmark again sometime with a more recent compiler, because there is no way an AMD 8350 can beat an Intel 980x speed wise.

I still have that old computer somewhere on the attic, although I don't use it anymore it is still working.

mvk · Post by **mvk** » Thu Apr 30, 2015 2:46 pm

Joost Buijs wrote:I see you are still using the data you took once on my old 980x computer with a very outdated GNU compiler on it.
Maybe you have to run the benchmark again sometime with a more recent compiler, because there is no way an AMD 8350 can beat an Intel 980x speed wise.

I still have that old computer somewhere on the attic, although I don't use it anymore it is still working.

There is a core count and frequency difference:
8350: 8 * 4 GHz = 32 GHz, with 18 Mnps that gives 1777 cycles/node
980x: 6 * 3.7 GHz = 22.2 GHz. with 16.5 Mnps that gives 1345 cycles/node
Which seems about right. I don't mind to measure again of course. Let's chat on fics.

Joost Buijs · Post by **Joost Buijs** » Thu Apr 30, 2015 3:09 pm

mvk wrote:
Joost Buijs wrote:I see you are still using the data you took once on my old 980x computer with a very outdated GNU compiler on it.
Maybe you have to run the benchmark again sometime with a more recent compiler, because there is no way an AMD 8350 can beat an Intel 980x speed wise.

I still have that old computer somewhere on the attic, although I don't use it anymore it is still working.
There is a core count and frequency difference:
8350: 8 * 4 GHz = 32 GHz, with 18 Mnps that gives 1777 cycles/node
980x: 6 * 3.7 GHz = 22.2 GHz. with 16.5 Mnps that gives 1345 cycles/node
Which seems about right. I don't mind to measure again of course. Let's chat on fics.

I remember when you did this test there were some problems to get the latest GCC working, so you tested with an old GCC version which clearly performed worse than expected.
Right now I'm busy to get my engine running under Windows 10 for the RPI-2b which just came out.
When I'm finished with this I will put a decent version of GCC on the old computer, and when you have time you may rerun the test if you wish.

Pi64: Raspberry Pi 2B 64 element bramble

A real Pi64 example

Pi64 Plan Next Step

Pi64 Local Network

Pi64: Benchmark estimates

Re: Pi64: Benchmark estimates

Re: Pi64: Benchmark estimates

Re: Pi64: Benchmark estimates

Re: Pi64: Benchmark estimates

Re: Pi64: Benchmark estimates

Re: Pi64: Benchmark estimates