| View previous topic :: View next topic |
| Author |
Message |
Charles Roberson
Joined: 13 Mar 2006 Posts: 1697 Location: North Carolina, USA
|
Post subject: Adhoc Supercomputer in a Day Posted: Sat Aug 04, 2012 11:35 pm |
|
|
Adhoc Supercomputer in a Day
This chronicles my experiences trying to create an adhoc supercomputer of borrowed machines in a day.
Why an adhoc Supercomputer? For those of us that enter computer chess events on a regular basis
(multiple times per year), it can be expensive keeping hardware at competitive performance
levels. Every year somebody has bought newer, cheaper and faster hardware. Every few years, the newer technology
allows for more CPUs in a machine. So, it can be expensive keeping
competitive hardware. My idea to resolve that issue is creating an adhoc temporary supercomputer from
borrowed personal computers. If enough machines of reasonable performance can be utilized one
might be able to remain competitive every year without a major yearly expense.
The 2012 CCT took place on the weekend of February 25. I borrowed several machines
on Thursday night and Friday. The goal was to form a cluster (26 procs: 4 quads,
1 oct and a dual) with them and run some three year old code I wrote and embedded
in my chess program to run on the cluster.
1) Previous test experience.
I tried running two quads and a dual in a 110 square foot room. It was during the fall
in NC USA and the computers kept rebooting due to heat. With the windows open and three
large fans running, the computers rebooted about every 5 minutes.
2) The plan.
I plan on solving the heat problem by putting the cluster together in my garage in February. If it gets
too hot, I will open the garage door. The expected outdoor temperature is 40 to 59 degrees Farenheit. All
machines will be configured and tested in my kitchen and then migrated to the garage.
I'll use an inexpensive Gigabit switch to connect the computers with cat 6 cables. Only the primary
computer needs access to the internet. If the system is in the garage, I'll need to use some sort
of Wi-Fi solution to connect to my Wi-Fi network giving the primary machine internet access. Where
possible an extra hard-drive will be installed on the borrowed computers to install Linux. For the
other machines, I'll use an Ubuntu live CD to run Linux without mounting the owner's hard-drive.
This will consume some RAM for a RAM disk, but that should be fine assuming enough memory in the
computer.
3) The network hardware.
An inexpensive Gigabit switch was used as a network backbone connected to each computer with Cat 6
cables. This worked reasonably well, but there was an issue with lack of port buffering on the switch.
More on this in the programming architecture section. To solve the internet connectivity, I chose a
$50 Wi-Fi system
that had switchable functionality. It has a 3 position switch that allows operation as an access point,
repeater or Wi-Fi router. I set it to repeater mode and connected it to the Gigabit switch via a Cat 6
cable. This had a major benefit over using a USB Wi-Fi Adapter: I didn't need to install any
device drivers on any computers and it gave all computers internet access allowing me to switch machines
for the primary at will. One machine that was being considered
as a primary machine had two Gigabit ethernet ports and the rest had only 1 port.
4) Getting Linux on all the machines.
a) The machines with spare hard-drives.
This took much time. There were issues with CD reader sensitivity. One computer ran the Ubuntu 10.04
Live CD fine, but it had failures on installing. After burning multiple CD's and observing the failure
at the same point during installation, I switched CD burners and achieved a successful install. The second
machine would not run Ubuntu 10.04 nor would the third machine. My two machines had Linux installed for
the last several years. I created a Live/Install CD of Ubuntu 11.10 and the Live CD ran on the second and
third loaners. I was successful in getting the second loaner running Linux by 3:00 AM Saturday morning.
Much of the day had been consumed by purchasing parts and diagnosing an issue with a CD burner.
b) The machines using a Linux Live CD.
On Sunday morning (around 3 AM), I tried setting up a node with an Ubuntu Live CD. This went rather
well at first. The ssh server installed (see next section) as before, network connectivity
worked out easily and a simple single processor benchmark of my chess program ran cleanly. However,
I ran in to a problem. In order to run an MPI program across several computers, you need to use the
same user account on all machines and ssh needs this as well. So, I proceeded to add a user to the newly
booted system. This didn't work due to Ubuntu's Live CD only allows two user accounts and both are already
consumed. The two machines that were to run this way were the weakest machines, so I dropped this idea at
4 AM Sunday morning. This left me with 4 nodes: 2 quads and an oct.
5) The machines can talk to each other.
a) ssh
With 4 machines running Linux on Saturday morning when I awoke (7 AM after 3 hours sleep), only two of
them would work with each other without requiring me to type a password. My computers had an older version
of Ubuntu which allowed the use of rsh (an easier method to deal with). The newer versions of Ubuntu
completely drop rsh and force ssh on you. For most uses this may be the best decision, but not necessarily
in this case. After trying several combinations of ssh configuration files, it was 8:30 AM giving me
30 minutes before the first round. So, I ran a single processor version on one of the borrowed computers
which happened to run Telepath with a 10% speed up over my fastest machine. Also, it had twice the memory
of my best machine.
b) sshd
Once the first round was well underway, I asked the group about ssh expertise. Thanks to Dr. Bob Hyatt,
Jon Dart and some others, the root problem was found: the new versions of Ubuntu don't install sshd - the
ssh server. I installed it on the machines and received examples of config files from the online
group. Eventually, I found a workable combination of ssh configuration parameters that allowed either of
the two borrowed machines to
access each other without a login prompt. With more effort, I was able to get my fastest machine to access
the others that way. However, my second machine was never able to communicate with the others. This left me
with 3 operational machines: 2 quads and an oct.
6) Open MPI versions.
My machines had Open MPI version 1.3.3. This version wasn't available any longer on their web site. The
Ubuntu package manager had a fairly new 1.4.4 version. I installed it straight
from Ubuntu on the new machines. Each machine was tested individually with all processors successfully.
This was very good: I didn't need to recompile the program for the new versions of Open MPI. Then
I ran a full system (16 processors) test which failed. The older version of Open MPI didn't work with the newer
versions. Installing the newer version on my machine failed. Given that this was Sunday morning at 5:30 AM, I
decided that two machines with a total of 12 processors might be sufficient. This worked! However, there was
an issue which happened at 8 PM Saturday night. I decided to get some sleep and resolve it in the morning.
7) The Programming Architecture.
I used Open MPI and coded an implementation of YBW concept. This was my own design and implementation which I
coded and tested 3 years earlier.
I awoke Sunday morning at 3 AM. After the failures of running machines in Live CD mode, I decide to address
the 2 machine/12 processor mini-cluster problems from the night before. Years ago, I noticed that the
distributed version took 3 times as many nodes to do searches as the single CPU implementation. I thought this
was due to loss of shared transposition tables. Some of it was. Now, that I have more CPU's to test with I
found that wasn't all of it. An 8 CPU test on the 8 processor system yielded a 10x node count explosion and
using an oct and a quad together for 12 processors yielded a 15x node count explosion. Clearly, this isn't due
to just lack of shared transposition tables. After a little thought, I realized the issue was in how I handled
split points for type 2 and type 3 nodes. I thought up a quick fix for this which was quite simple.
8) Performance
Before the first round of Sunday morning, the 12 processor distributed system performed benchmarks around
30% faster than the single processor version. Of course, this is less than hoped for. However, each processor
of the 8 processor system had been benchmarked at 75% the nodes per second of my best system.
Conclusion
In my academic and professional life, I have had dry runs in advance of events for any systems that
needed to work on a given day. For situations like this, it is difficult to borrow all machines for more
than a day or two. Getting them for two consecutive weekends would be excellent, but impractical. On the
other hand, all of those loaning machines agreed to do it for the next event and give me an extra
day. Given the successful configuration of the 12 processor/two machine system, I am optimistic that the
next effort will be met with more success. Also, I never made it to the garage and my wife was a very good
sport about all the equipment in the kitchen.
I'll not try this for the ACCA World Computer Rapid Chess Championships due to it being in July with expectedly
high temperatures. Maybe, I'll try for the Pan American event in October/November. Certainly the next CCT will
be a prime time to try again. |
|
| Back to top |
|
 |
|
| Subject |
Author |
Date/Time |
Adhoc Supercomputer in a Day |
Charles Roberson |
Sat Aug 04, 2012 11:35 pm |
Re: Adhoc Supercomputer in a Day |
Joshua Shriver |
Mon Aug 06, 2012 4:23 pm |
Re: Adhoc Supercomputer in a Day |
Robert Hyatt |
Mon Aug 06, 2012 6:58 pm |
Re: Adhoc Supercomputer in a Day |
Joshua Shriver |
Mon Aug 06, 2012 7:01 pm |
Re: Adhoc Supercomputer in a Day |
Robert Hyatt |
Mon Aug 06, 2012 7:20 pm |
Re: Adhoc Supercomputer in a Day |
Ricardo Barreira |
Mon Aug 06, 2012 7:31 pm |
Re: Adhoc Supercomputer in a Day |
Vincent Diepeveen |
Tue Aug 07, 2012 1:45 am |
Re: Adhoc Supercomputer in a Day |
Jon Dart |
Thu Aug 16, 2012 2:46 pm |
Re: Adhoc Supercomputer in a Day |
Vincent Diepeveen |
Thu Aug 16, 2012 3:06 pm |
Re: Adhoc Supercomputer in a Day |
Jon Dart |
Thu Aug 16, 2012 5:21 pm |
Re: Adhoc Supercomputer in a Day |
Ricardo Barreira |
Fri Aug 17, 2012 11:09 am |
Re: Adhoc Supercomputer in a Day |
Vincent Diepeveen |
Sat Aug 18, 2012 7:41 am |
Re: Adhoc Supercomputer in a Day |
Ricardo Barreira |
Sat Aug 18, 2012 10:46 am |
Re: Adhoc Supercomputer in a Day |
Vincent Diepeveen |
Sat Aug 18, 2012 3:13 pm |
Re: Adhoc Supercomputer in a Day |
Ricardo Barreira |
Sat Aug 18, 2012 5:42 pm |
Re: Adhoc Supercomputer in a Day |
Vincent Diepeveen |
Mon Aug 20, 2012 8:27 am |
Re: Adhoc Supercomputer in a Day |
Ricardo Barreira |
Mon Aug 20, 2012 10:51 am |
Re: Adhoc Supercomputer in a Day |
Jon Dart |
Tue Aug 21, 2012 1:26 am |
Re: Adhoc Supercomputer in a Day |
Vincent Diepeveen |
Tue Aug 21, 2012 9:31 am |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|