Cluster versions of chess programs.

Kingghidorah · Post by **Kingghidorah** » Tue Dec 20, 2011 6:34 pm

How much trouble would it be to make a cluster version of a commercial version of say todays programs and what would be the limit as far as the cores goes?

Could I make it a 2400 core,100,000 core version, unlimited?

I am asking this question as a non-programmer.

Edmund · Post by **Edmund** » Tue Dec 20, 2011 10:04 pm

With current state of the art algorithms the gain for each additional core is diminishing.

According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html

one can expect the following speedup for 1-16 cores:

Code: Select all

          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results

abulmo · Post by **abulmo** » Wed Dec 21, 2011 12:03 am

Edmund wrote:With current state of the art algorithms the gain for each additional core is diminishing.

According to Hyatt's Paper "The DTS high-performance parallel tree search algorithm" http://www.cis.uab.edu/hyatt/search.html

one can expect the following speedup for 1-16 cores:
Code: Select all
          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results

Using Brent's theorem, the time to search a position is :
T(P) = W / P + C
Where T is the spent time, P the number of processor,
W the parallelizable work & C the non-parallelizable work.
T(1) = W + C
T(infinity) = C
speedup = T(1) / T(P)
infinite speedup = (W+C)/C
So it is possible to extrapolate the above table :

Code: Select all

          +-------------+-----+-----+-----+-----+------+------+
          |# processors |  1  |  2  |  4  |  8  |  16  | inf  |
          +-------------+-----+-----+-----+-----+------+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 | 36.4 |
          +-------------+-----+-----+-----+-----+------+------+
                 DTS performance results extrapolated

So, even with an infinite number of CPU, we won't get such a big speedup, only something like x36.

Note : W & C and thus the speedups are not absolute values, they depend on the position analyzed and the time spend on it. So number greater than x36 are probably achievable.

sje · Post by **sje** » Wed Dec 21, 2011 1:27 am

It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.

You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.

Milos · Post by **Milos** » Wed Dec 21, 2011 3:09 am

sje wrote:It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.

You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.

And then you get a program running on million cores weaker than an alpha-beta program running on 16 cores. Really great idea...

Kingghidorah · Post by **Kingghidorah** » Wed Dec 21, 2011 12:02 pm

Milos wrote:
sje wrote:It is possible to get a nearly linear speed increase proportional to an increase in the processor count. It's not pretty, though.

You have to give up alpha/beta and go with a plain minimax approach. With the right communication protocol, it would be possible to keep many millions of processors doing useful work, all contributing near equally to the final result. Proof: This has already been done with distributed perft() calculations which get a near linear speed increase.
And then you get a program running on million cores weaker than an alpha-beta program running on 16 cores. Really great idea...

So I guess we will be waiting for the 16 core running at 100GHz a pop

sje · Post by **sje** » Wed Dec 21, 2011 1:10 pm

Assume there's a traditional A/B searcher running 100 threads on 100 processors. As has been shown, at that point increasing the processor count by any factor will not measurably help the throughput of the system. If fact, the additional communication and synchronization costs will cause overall performance to decrease at some point.

Now consider a 10,000 processor array running plain minimax. It will search as deeply as the A/B system in about the same time. If the mean branching factor in the minimax search is about 32, then increasing the minimax searcher processor count by a factor of 32 will make the system see an extra ply. But increasing the A/B system processor count by the same factor will not help it. At some point, the minimax array will became smarter than the A/B searcher.

Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.

hgm · Post by **hgm** » Wed Dec 21, 2011 1:55 pm

Of course this could even be made practical by putting single-core alpha-beta searches on top of a plain minimax tree for the plies close to the root. Then you already start out with something that is completely competative on a single core, so that the better asymptotics quickly catch up with a smart but poorly scaling algorithm.

Milos · Post by **Milos** » Wed Dec 21, 2011 1:55 pm

sje wrote:Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.

Of course it's asymptotic as long as you are using any type of shared resources (killers, history counters, TT). And if you have zero shared resources your program will be so terribly bad that even millions of parallel instances would not help.

Uri Blass · Post by **Uri Blass** » Wed Dec 21, 2011 4:14 pm

Milos wrote:
sje wrote:Admittedly, a minimax search is not very practical. But the concept shows that the statement "search performance is asymptotic to some processor count limit" is just plain wrong for a minimax cluster.
Of course it's asymptotic as long as you are using any type of shared resources (killers, history counters, TT). And if you have zero shared resources your program will be so terribly bad that even millions of parallel instances would not help.

As H.G.Muller explained you do not have to have a pure minimax.

He suggested putting single-core alpha-beta searches on top of a plain minimax tree for the plies close to the root.

If you can use million cores then
even better may be putting some clusters of 40 cores that use alpha beta on a top of a plain minimax tree for the plies close to the root.

everyone of the clusters can simply analyze a different position that is 3 or even 4 plies after the root and based on all the scores from all the clusters you may be able to get a move.

Cluster versions of chess programs.

Cluster versions of chess programs.

Re: Cluster versions of chess programs.

Re: Cluster versions of chess programs.

Re: Cluster versions of chess programs.

Re: Cluster versions of chess programs.

Re: Cluster versions of chess programs.

Minimax cluster

Re: Minimax cluster

Re: Minimax cluster

Re: Minimax cluster