Cluster versions of chess programs.

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Kingghidorah
Posts: 224
Joined: Thu Nov 10, 2011 5:23 pm
Location: CT,USA

Cluster versions of chess programs.

Post by Kingghidorah »

How much trouble would it be to make a cluster version of a commercial version of say todays programs and what would be the limit as far as the cores goes?

Could I make it a 2400 core,100,000 core version, unlimited?

I am asking this question as a non-programmer.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Cluster versions of chess programs.

Post by Edmund »

With current state of the art algorithms the gain for each additional core is diminishing.

According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html

one can expect the following speedup for 1-16 cores:

Code: Select all

          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results

abulmo
Posts: 151
Joined: Thu Nov 12, 2009 6:31 pm

Re: Cluster versions of chess programs.

Post by abulmo »

Edmund wrote:With current state of the art algorithms the gain for each additional core is diminishing.

According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html

one can expect the following speedup for 1-16 cores:

Code: Select all

          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results

Using Brent's theorem, the time to search a position is :
T(P) = W / P + C
Where T is the spent time, P the number of processor,
W the parallelizable work & C the non-parallelizable work.
T(1) = W + C
T(infinity) = C
speedup = T(1) / T(P)
infinite speedup = (W+C)/C
So it is possible to extrapolate the above table :

Code: Select all

          +-------------+-----+-----+-----+-----+------+------+
          |# processors |  1  |  2  |  4  |  8  |  16  | inf  |
          +-------------+-----+-----+-----+-----+------+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 | 36.4 |
          +-------------+-----+-----+-----+-----+------+------+
                 DTS performance results extrapolated
So, even with an infinite number of CPU, we won't get such a big speedup, only something like x36.

Note : W & C and thus the speedups are not absolute values, they depend on the position analyzed and the time spend on it. So number greater than x36 are probably achievable.
Richard
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Cluster versions of chess programs.

Post by sje »

It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.

You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Cluster versions of chess programs.

Post by Milos »

sje wrote:It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.

You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
And then you get a program running on million cores weaker than an alpha-beta program running on 16 cores. Really great idea... :roll:
User avatar
Kingghidorah
Posts: 224
Joined: Thu Nov 10, 2011 5:23 pm
Location: CT,USA

Re: Cluster versions of chess programs.

Post by Kingghidorah »

Milos wrote:
sje wrote:It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.

You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
And then you get a program running on million cores weaker than an alpha-beta program running on 16 cores. Really great idea... :roll:
So I guess we will be waiting for the 16 core running at 100GHz a pop
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Minimax cluster

Post by sje »

Assume there's a traditional A/B searcher running 100 threads on 100 processors. As has been shown, at that point increasing the processor count by any factor will not measurably help the throughput of the system. If fact, the additional communication and synchronization costs will cause overall performance to decrease at some point.

Now consider a 10,000 processor array running plain minimax. It will search as deeply as the A/B system in about the same time. If the mean branching factor in the minimax search is about 32, then increasing the minimax searcher processor count by a factor of 32 will make the system see an extra ply. But increasing the A/B system processor count by the same factor will not help it. At some point, the minimax array will became smarter than the A/B searcher.

Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
User avatar
hgm
Posts: 28440
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Minimax cluster

Post by hgm »

Of course this could even be made practical by putting single-core alpha-beta searches on top of a plain minimax tree for the plies close to the root. Then you already start out with something that is completely competative on a single core, so that the better asymptotics quickly catch up with a smart but poorly scaling algorithm.
Last edited by hgm on Wed Dec 21, 2011 1:56 pm, edited 1 time in total.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Minimax cluster

Post by Milos »

sje wrote:Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
Of course it's asymptotic as long as you are using any type of shared resources (killers, history counters, TT). And if you have zero shared resources your program will be so terribly bad that even millions of parallel instances would not help.
Uri Blass
Posts: 11148
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Minimax cluster

Post by Uri Blass »

Milos wrote:
sje wrote:Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
Of course it's asymptotic as long as you are using any type of shared resources (killers, history counters, TT). And if you have zero shared resources your program will be so terribly bad that even millions of parallel instances would not help.
As H.G.Muller explained you do not have to have a pure minimax.

He suggested putting single-core alpha-beta searches on top of a plain minimax tree for the plies close to the root.

If you can use million cores then
even better may be putting some clusters of 40 cores that use alpha beta on a top of a plain minimax tree for the plies close to the root.

everyone of the clusters can simply analyze a different position that is 3 or even 4 plies after the root and based on all the scores from all the clusters you may be able to get a move.