How much trouble would it be to make a cluster version of a commercial version of say todays programs and what would be the limit as far as the cores goes?
Could I make it a 2400 core,100,000 core version, unlimited?
I am asking this question as a non-programmer.
Cluster versions of chess programs.
Moderator: Ras
-
Kingghidorah
- Posts: 224
- Joined: Thu Nov 10, 2011 5:23 pm
- Location: CT,USA
-
Edmund
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: Cluster versions of chess programs.
With current state of the art algorithms the gain for each additional core is diminishing.
According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html
one can expect the following speedup for 1-16 cores:
According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html
one can expect the following speedup for 1-16 cores:
Code: Select all
+-------------+-----+-----+-----+-----+------+
|# processors | 1 | 2 | 4 | 8 | 16 |
+-------------+-----+-----+-----+-----+------+
|speedup | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
+-------------+-----+-----+-----+-----+------+
Table 3 DTS performance results
-
abulmo
- Posts: 151
- Joined: Thu Nov 12, 2009 6:31 pm
Re: Cluster versions of chess programs.
Using Brent's theorem, the time to search a position is :Edmund wrote:With current state of the art algorithms the gain for each additional core is diminishing.
According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html
one can expect the following speedup for 1-16 cores:Code: Select all
+-------------+-----+-----+-----+-----+------+ |# processors | 1 | 2 | 4 | 8 | 16 | +-------------+-----+-----+-----+-----+------+ |speedup | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 | +-------------+-----+-----+-----+-----+------+ Table 3 DTS performance results
T(P) = W / P + C
Where T is the spent time, P the number of processor,
W the parallelizable work & C the non-parallelizable work.
T(1) = W + C
T(infinity) = C
speedup = T(1) / T(P)
infinite speedup = (W+C)/C
So it is possible to extrapolate the above table :
Code: Select all
+-------------+-----+-----+-----+-----+------+------+
|# processors | 1 | 2 | 4 | 8 | 16 | inf |
+-------------+-----+-----+-----+-----+------+------+
|speedup | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 | 36.4 |
+-------------+-----+-----+-----+-----+------+------+
DTS performance results extrapolated
Note : W & C and thus the speedups are not absolute values, they depend on the position analyzed and the time spend on it. So number greater than x36 are probably achievable.
Richard
-
sje
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Re: Cluster versions of chess programs.
It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.
You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
-
Milos
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Cluster versions of chess programs.
And then you get a program running on million cores weaker than an alpha-beta program running on 16 cores. Really great idea...sje wrote:It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.
You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
-
Kingghidorah
- Posts: 224
- Joined: Thu Nov 10, 2011 5:23 pm
- Location: CT,USA
Re: Cluster versions of chess programs.
So I guess we will be waiting for the 16 core running at 100GHz a popMilos wrote:And then you get a program running on million cores weaker than an alpha-beta program running on 16 cores. Really great idea...sje wrote:It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.
You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
-
sje
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Minimax cluster
Assume there's a traditional A/B searcher running 100 threads on 100 processors. As has been shown, at that point increasing the processor count by any factor will not measurably help the throughput of the system. If fact, the additional communication and synchronization costs will cause overall performance to decrease at some point.
Now consider a 10,000 processor array running plain minimax. It will search as deeply as the A/B system in about the same time. If the mean branching factor in the minimax search is about 32, then increasing the minimax searcher processor count by a factor of 32 will make the system see an extra ply. But increasing the A/B system processor count by the same factor will not help it. At some point, the minimax array will became smarter than the A/B searcher.
Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
Now consider a 10,000 processor array running plain minimax. It will search as deeply as the A/B system in about the same time. If the mean branching factor in the minimax search is about 32, then increasing the minimax searcher processor count by a factor of 32 will make the system see an extra ply. But increasing the A/B system processor count by the same factor will not help it. At some point, the minimax array will became smarter than the A/B searcher.
Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
-
hgm
- Posts: 28440
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Minimax cluster
Of course this could even be made practical by putting single-core alpha-beta searches on top of a plain minimax tree for the plies close to the root. Then you already start out with something that is completely competative on a single core, so that the better asymptotics quickly catch up with a smart but poorly scaling algorithm.
Last edited by hgm on Wed Dec 21, 2011 1:56 pm, edited 1 time in total.
-
Milos
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Minimax cluster
Of course it's asymptotic as long as you are using any type of shared resources (killers, history counters, TT). And if you have zero shared resources your program will be so terribly bad that even millions of parallel instances would not help.sje wrote:Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
-
Uri Blass
- Posts: 11148
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Minimax cluster
As H.G.Muller explained you do not have to have a pure minimax.Milos wrote:Of course it's asymptotic as long as you are using any type of shared resources (killers, history counters, TT). And if you have zero shared resources your program will be so terribly bad that even millions of parallel instances would not help.sje wrote:Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
He suggested putting single-core alpha-beta searches on top of a plain minimax tree for the plies close to the root.
If you can use million cores then
even better may be putting some clusters of 40 cores that use alpha beta on a top of a plain minimax tree for the plies close to the root.
everyone of the clusters can simply analyze a different position that is 3 or even 4 plies after the root and based on all the scores from all the clusters you may be able to get a move.