Cluster Rybka

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

diep wrote:Bob, the real bottom line is, shared hashtables with some dozens of cores is real important. Not only the transposition effect is there, the transposition effect also aborts really a lot of useless searches.

That makes having a shared hashtable of some sort real important.

Getting that to work on a cluster with just a few nodes with the total nps far bigger than the communication latency possible, is really complicated.

Your idea of delayed hash lookup has been tried already. It's speedup 2 out of 30.

With YBW another effect seems to be that having last few plies a hashtable makes the time shorter to put other cpu's to work. What i tried back in 2003 at supercomputer is put to split closer to leaves in PV nodes than in non-pv nodes. That didn't help much either. Real significant measurement was of course not possible. That's soon easier to do with 32 core Nehalem, 64 threads machines (4 sockets).

Seeing any chance to obtain such a box for your department?

Vincent

p.s. i would argue a cluster where the hash pressure is some 1000 times bigger than communication speed to remote hash; such a cluster is only interesting if you got really a lot of nodes in it
We have two clusters already. The most recent is 70 nodes with dual quad-core xeons. So 70 nodes with 8 2.33ghz processors per node. With the usual infiniband interconnect.

I ran on one node for the ACCA event over the weekend and averaged somewhere between 18M and 20M nodes per second... I'm going to do something with this cluster, to see what is possible. I'm not going to a distributed hash model, at least initially. But there should be a way to extract useful speedup from this much hardware... Even if it is sub-optimally implemented. 540 cores at 2.5M nodes per core per second is a lot of computing, even if 50% gets wasted...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

Uri Blass wrote:
bob wrote:Those numbers sound pretty reasonable. I'm not so happy with "those" that report numbers that are simply fictional, and which anybody that has done any reading or research into parallel search could immediately recognize as bogus.

I still hope that one day someone will post some real numbers on Rybka's parallel speedup on an 8-way box, by running some "normal" positions to a fixed depth using 1, 2, 4 and 8 processors. He claims to scale better than any other program. Somehow I doubt it. Maybe "as good as" if he is lucky. But so far we just have urban legend to go on. Speedups for my program are quite easy to produce and anybody can do it.
I think that fixed depth may be misleading because rybka may play better at fixed depth with more cores thanks to doing less pruning at the same depth.

It is possible to test it simply by playing fixed depth match between
rybka single core and rybka 4 cores.

Uri
I was more interested in the speedup scaling, since Vas has claimed publicly that his speedup is better than anybody else's... I just don't have any good boxes with windows or I would try that myself...
Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Cluster Rybka

Post by Dann Corbit »

I have no idea how well Rybka scales on a cluster, but here are the SMP scaling numbers:

From:
http://www.computerchess.org.uk/ccrl/40 ... t_all.html

Code: Select all

Rank Name Rating Score Average Opponent Draws Games LOS ELO + - 
1 Rybka 3 64-bit 4CPU 3222 +22 -22 78.9% -205.9 34.0% 817   98.2% 
  Rybka 3 64-bit 2CPU 3180 +29 -28 76.9% -184.5 37.1% 466 87.9% 
  Rybka 3 64-bit 1CPU 3159 +20 -20 71.6% -154.1 37.0% 913 99.2% 
From
http://www.husvankempen.de/nunn/40_40%2 ... liste.html

Code: Select all

no Program Elo + - Games Score Av.Op. Draws 
1 Rybka 3 x64 4CPU 3204 15 15 1871 79.8% 2965 28.6% 
2 Rybka 3 x64 2CPU 3163 13 13 2231 77.7% 2946 31.2% 
4 Rybka 3 x64 1CPU 3111 21 21 792 75.2% 2919 32.7% 
So we have:
1 CPU -> 2 CPUs (+31 Elo, +52 Elo) average 41.5 Elo gain
2 CPUs-> 4 CPUs (+42 Elo, +41 Elo) average 41.5 Elo gain


We can compare with Zappa Mexico II:
From:
http://www.husvankempen.de/nunn/40_40%2 ... liste.html

Code: Select all

no Program Elo + - Games Score Av.Op. Draws 
08 Zappa Mexico II x64 4CPU 3025 13 13 1657 52.4% 3008 43.3% 
29 Zappa Mexico II x64 2CPU 2964 11 11 2028 52.2% 2948 42.7% 
52 Zappa Mexico II x64 1CPU 2923 27 27 350 58.3% 2865 46.3% 
From:
http://www.computerchess.org.uk/ccrl/40 ... t_all.html

Code: Select all

Rank Name Rating Score Average Opponent Draws Games LOS ELO + - 
3 Zappa Mexico II 64-bit 4CPU 3074 +17 -17 60.6% -68.2 45.4% 1129 73.1% 
  Zappa Mexico II 64-bit 2CPU 3016 +29 -29 55.4% -35.2 44.6% 368 60.6% 
  Zappa Mexico II 64-bit 1CPU 2962 +19 -19 48.4% +12.1 44.1% 858 51.4% 
So we have:
1 CPU -> 2 CPUs (+41 Elo, +54 Elo) average 47.5 Elo gain
2 CPUs-> 4 CPUs (+61 Elo, +58 Elo) average 59.5 Elo gain

Seems like the regular rule of thumb holds pretty well in both cases.
(around 40-60 Elo per doubling of speed)

As far as I know, nobody has ever done very well with clusters as far as efficiency goes (most experiments I have seen had terrible efficiency compared to SMP implementations).
jswaff

more data

Post by jswaff »

Without the gimped up search...

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1	1.00	1.00	1.00	1.00
2	0.81	1.23	1.39	1.75
3	0.82	1.22	1.68	2.18
4	0.81	1.24	1.85	2.44
5	0.83	1.21	2.03	2.59
6	0.85	1.18	2.19	2.75
7	0.84	1.18	2.24	2.84
8	0.86	1.16	2.36	2.98
bob wrote:
jswaff wrote:I recently ran a "root splitting only" experiment with Prophet. Each thread would get a root move and search it, and update alpha and the PV when finished if the score > alpha. With two processors I saw a speedup of 1.56. With 3 processors 1.89, with 4: 2.08, with 5: 2.17, with 6: 2.27, with 7: 2.27, with 8: 2.27. The root move ordering was simply : make sure the first move from the last iteration is first, and sort the rest by node count. So I have to agree, the algorithm sucks from a scalability perspective. Still, it's something.

The "unsynchronized depth" issue is sticky. I can't think of a sensible way to handle that.
2.27 is actually remarkable and suggests some real move ordering issues, unless you produced that number on some _tactical_ positions only. There you can get above 2, because tactical positions are those where move ordering is bad and the final "best move" is a "surprise" that rises to the top at the end of the search. Here are some numbers from yesterday's tournament:

Code: Select all

               19     3:44   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               19->   3:53   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               20     4:05   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4
               20->   4:19   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4 (s=2)




               18    14.71   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               18->  19.09   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               19    24.20   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               19->  30.63   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               20     1:01   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5
               20->   1:21   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5

Let's just take the last search output (this came from the Rybka game so it is a good example from a real game rather than being a test position. the first move at the root took 14.7 seconds (depth=18) and the remainder of the moves took just over 4 seconds total. If you split at the root, the processor that gets that move (best move, Rfc8) is going to take 14.7 seconds. Let's assume that the other moves get searched in zero time (which is not possible, but is an "optimal" case.) So rather than taking a total of 19 seconds, you are done in 14.7 seconds. Speedup = 19.09 / 14.7, 1.3x being generous. For depth 19, the first move takes 24.2 seconds, all the reas take 6.6 seconds. Speedup = 30.6 / 24.2. This is easy to understand. If, instead, you take a position where move ordering is bad, you might get this kind of output:

Code: Select all

               17->  59.74   0.28   19. ... Kf8 20. Bd2 Qb6 21. Ng3 Qd8
                                    22. Be3 Nb7 23. b4 Rc3 24. Ne2 Rcc8
                                    25. Rc1 Rxc1 26. Qxc1 Rc8 27. Nc3 a5
                                    28. bxa5 Nxa5 (s=6)
               18     1:12   0.35   19. ... Kf8 20. Ng3 Qb8 21. Be3 Nb7
                                    22. b4 Rc3 23. Rb3 Rc8 24. Qd2 a5 25.
                                    Rc1 axb4 26. axb4 Rxc1+ 27. Qxc1 Ra4
                                    28. Qf1 Qe8 (s=5)
               18     1:16   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    (s=4)
               18->   1:27   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    (s=3)

In this case, we finish depth 17 at 60 seconds. At depth 18, it takes 12 seconds to find the score for Kf8, then another 4 seconds ti discover that Bd8 is even better, then 11 seconds to search the rest of the moves. We get a larger gain here since we will search Kf8 and Bd8 at the root and get a score back for both in a total of 12 seconds, rather than the serial 16 seconds. So we save the 4 second search time since it is done in parallel with the 12 second search, and we are done after 12 seconds total rather than after 16 (again, all the others we assume go in zero time to be as favorable to this algorithm timing as possible).

This kind of case is where you get the most from splitting at the root, and with today's ultra-low branching factor, what you get is very little.

You didn't say what you are doing elsewhere, LMR? NULL R=3? Etc. The lower your branching factor the worse this looks...

I brought this up because of the various discussions and comments yesterday, particularly people asking about "why all the different depths? why is the depth bouncing up and down? How fast is this 5-node 8-core-per-node cluster? Etc." The answer is it is no faster than a normal 8-core 1-node box. I am finally making preparations to do a _real_ cluster approach. And I know I am not going to get 100 Elo from a 5-node parallel search. You lose too much. No shared hash table. No shared killers. No shared anything whatsoever. So the losses are large. But if you can get a real 2.5 speedup out of 5 nodes, that would be significant, as it would be like searching 50M nodes per second on the hardware I used yesterday when we were typically searching 20M nodes per second. Or more importantly, if I can get 35x on the big 70-node 8-core node, that will be serious numbers. 700M nodes per second is serious computing. I have no idea whether I can get that kind of speedup. I believe I can get 2.5x out of 5. But stretching that ratio to 70 is improbable. But anything > 1.0 is good...
bob wrote:I had a chance to watch Rybka's analysis during our PanAM game today, and what I saw was a far cry from what I was expecting.

Several were noticing that Rybka's depth and score were bouncing all over the place. I finally figured out why on a position where Crafty played QxQ and Rybka had one recapture move (RxQ) and several other moves. We were seeing all sorts of -9.0 scores being kibitzed. As I went through them, what I discovered was this: Rybka uses an algorithm first tried (and then discarded because it is useless) back around 1978 or so by Monty Newborn. What is happening is this: Each root move is given to a different cluster node when the first iteration of the search starts. Each move is searched independently until time runs out. The depth is not synchronized in any way, so when you look at Rybka's analysis, you might see this:

Code: Select all

depth=18 time=5.08 node=7028735 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=20 time=6.09 node=8650750 speed=1357325 score=+0.75
pv="Ra1 Kf8 Kf2"

depth=19 time=6.09 node=8650750 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=21 time=7.11 node=10131160 speed=1357325
score=+0.75 pv="Ra1 Kf8 Kf2"

depth=21 time=8.20 node=11654740 speed=1420616
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=16 time=8.20 node=11654740 speed=1420616
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=20 time=10.16 node=15113065 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8" 

depth=17 time=10.16 node=15113065 speed=1520741
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=21 time=11.17 node=16464745 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"

depth=18 time=12.19 node=18357100 speed=1520741
score=+0.52 pv="Rc1 Kf7 Kf2"

depth=18 time=13.20 node=19605795 speed=1520741
score=+0.65 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=14.22 node=20854490 speed=1520741
score=+0.75 pv="Ra1 Kf8 Kf2 Ndc4"

depth=19 time=15.23 node=22244790 speed=1520741
score=+0.56 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=17.38 node=25377725 speed=1460504
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=20 time=17.38 node=25377725 speed=1460504
score=+0.55 pv="g4 Kf7 Ra1 Ke7"

depth=19 time=18.28 node=27213820 speed=1460504
score=+0.52 pv="Rc1 Kf7 Kf2 Ndc4"

depth=22 time=19.30 node=29466620 speed=1527572
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"
If you look at that at first glance, you wonder why is the depth bouncing around going forward and backward all over the place? But if you look more carefully, and find just the kibitzes for the same move and ignore the others, you will find that the depths and scores now look sensible.

The primary fault with this is that the speedup is zero. If you watch the typical program, 90% of the search effort is on the _first_ move. Here, each node is searching a "first move" which breaks alpha/beta completely. Even worse, there is a depth issue in that each of the root moves are often searched to different depths. Which is better, +1 at depth 17 or +.01 at depth 21? That was the problem Monty discovered and that is why this is no good, without even considering the issue that the distributed search is not speeding up the process one bit.

I have no idea why such an algorithm would be used. There is no up-side to doing this and using a single node would be no slower. And it actually would be a bit faster since all root moves get searched to the same depth so that there is no issue in comparing scores from different depths.

At least the hyperbole about the "40 core monster" has been discovered. Yes, Rybka is extremely strong. But the cluster Rybka is absolutely no stronger and perhaps a bit weaker due to the unsynchronized depth issue.

live and learn...

BTW, after watching this game it is apparent that the requirement for kibitzing node counts, NPS, depth, etc is worthless. :) The numbers being kibitzed are absolute nonsense with respect to the accepted meaning of the terms... Several of us noticed the NPS numbers matching at several different points. That never happens for the rest of us. :)
User avatar
Eelco de Groot
Posts: 4658
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: Cluster Rybka

Post by Eelco de Groot »

bob wrote:
M ANSARI wrote:I don't think you can make a judgement regarding the cluster by simply presuming that it is using a certain algo.
Note that I did not "presume" anything. I saw the actual output which clearly shows the underlying algorithm. And I doubt using 5 nodes is worse than using 1 although it is absolutely possible. But as I said, I am certain it is no better than using one. This is old stuff with 30 years of history behind it.

The classic way of thinking how a cluster works ... by sharing memory and acting as one large parallel unit might not be the most practical or best option using PC's. Personally I think there are better ways of using multiple cores in a cluster. It is a well known fact that while you can add some knowledge that works great in certain positions in an engine, many times that knowledge is not added because it reduces the overall strength of the engine. So instead of using an active cluster as you envision, maybe using a passive cluster is more useful and easier to implement. I am not sure Vas is using a passive cluster but I wouldn't be surprised.
I am absolutely certain of what he is doing, nothing else explains the output we saw yesterday and the data was obvious once I began to look at the output carefully. It was producing so much "noise" that at first I paid no attention until someone asked "why is its depth bouncing up, down, back up, and why are the scores all over the place?" A quick and careful look explained what was going on.


In the case of a passive cluster you cannot really make it play weaker than its core Master machine unless you are an idiot.

It is actually not hard at all to be quite good and still play worse. message-passing overhead can creep in without your knowing. It's happened many times here as people use MPI/openMP/etc to do parallel programs that are worse as more nodes are added.
You can setup the different other engines to be experts in special fields and only get activated if and only if a certain pre-determined change of evaluation trigger is reached. I can think of a scenario where such an option would immediately add a few ELO points to a master engine if you use the Master without EGTB's and have the EGTB's handled with an independent slave computer. Not having to search EGTB's alone would make the master computer stronger ... and ofcourse it will still have the safety net of being able to solve any resulting 6 piece EGTB by the slave. So an active cluster sharing memory does not necessarily have to be the only way to use additional cores, passive might be the way to go forward.

The strongest chess ever played has probably been in Freestyle chess ... a good Freestyle chess player usually has several motherboards running independently ... and usually runs several flavors of engines to make sure all moves get a sanity check ... and will sometimes probe a position to see what it will reveal. In certain situations that person will switch from one engine to another depending on the dynamics of the game. He will know which engine performs better where. Rybka is blessed because in its arsenal it has Rybka Human, Rybka Default and Rybka Dynamic. Since Vas is one of the top Freestyle players, I am sure he understands best how to try to mimic a Freestyle player, and the 100 ELO points would have resulted in actual testing against a simple 8 core Rybka 3. The Cluster Rybka by the way has played quite a few tournaments on Playchess where it faced other Octa Rybka's and apparently came out ahead quite easily. So the 100 ELO gain does seem to be reasonable.
Freestyle is a completely different animal. Totally unrelated to playing a single game against a single opponent, and playing the best move you can play using all available hardware. For oddball things, different solutions are possible. But for playing a single game, this is not a new topic. There is almost 30 years of work on distributed chess available, with probably every idea one might ever think of already tried and tested.

100 elo against a single opponent in normal chess is not going to come from this approach. You need a speedup of a factor of 4 to reach that. 5 nodes splitting at the root will be hard-pressed to produce a speedup of even 1.5x...
I think Vas stated that, for Rybka presumably, he can get get 75 elo for each doubling so that makes the necessary speed-up already a lot less. I do not want to betray too much of Rybkas inner workings, so I am glad I do not actually know them but, speculatively, I suppose that some of the general structure of Rybkas search makes the distributed cluster-setup also more feasible. Rybka calculates with very varying depths, so it has to accomodate for the fact that alpha beta in general supposes every move gets searched to the same depths. If you want to do aggressive pruning some of this will have to go, that seems obvious to me.

For Vas I'm hoping that I'm not giving away big secrets here, also this seems rather straightforward :? Anyway my bet is that the slower nodes are used almost exclusively for safety checking of all nodes that are in principle labeled as 'refuted'. That way it does not matter if the plydepths are out of sync, because in the non-cluster version of Rybka, the iterations would have more or less stopped for these (searchtree-)nodes. That is the only way I can see Rybka reach its enormous depths :!: . The Cluster Rybka version is in this sense really a safer version of normal Rybka MP, as several people on Rybka forum also thought they could observe from watching it play.


Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
swami
Posts: 6659
Joined: Thu Mar 09, 2006 4:21 am

Re: Cluster Rybka

Post by swami »

Zach Wegner wrote:
diep wrote:What Cozzie & co are doing in Rybka who cares?

Deceiving others is their job, maybe. Not new.
Cozzie in Rybka? You must be confused...
Vincent seems to think or have always held the belief that Cozzie was the secret main programmer in Rybka team. :?

That's got my vote for the strange conspiracy theory. :wink:
krazyken

Re: Cluster Rybka

Post by krazyken »

One place I'd expect a cluster to be more useful is in pondering. You should be able to get a 100% ponder hit with enough nodes.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

krazyken wrote:One place I'd expect a cluster to be more useful is in pondering. You should be able to get a 100% ponder hit with enough nodes.
doesn't work. If you spread nodes across all moves you ponder, you search less deeply which hurts rather than helps, since each different move will only be pondered with one node, not the entire group.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

Eelco de Groot wrote:
bob wrote:
M ANSARI wrote:I don't think you can make a judgement regarding the cluster by simply presuming that it is using a certain algo.
Note that I did not "presume" anything. I saw the actual output which clearly shows the underlying algorithm. And I doubt using 5 nodes is worse than using 1 although it is absolutely possible. But as I said, I am certain it is no better than using one. This is old stuff with 30 years of history behind it.

The classic way of thinking how a cluster works ... by sharing memory and acting as one large parallel unit might not be the most practical or best option using PC's. Personally I think there are better ways of using multiple cores in a cluster. It is a well known fact that while you can add some knowledge that works great in certain positions in an engine, many times that knowledge is not added because it reduces the overall strength of the engine. So instead of using an active cluster as you envision, maybe using a passive cluster is more useful and easier to implement. I am not sure Vas is using a passive cluster but I wouldn't be surprised.
I am absolutely certain of what he is doing, nothing else explains the output we saw yesterday and the data was obvious once I began to look at the output carefully. It was producing so much "noise" that at first I paid no attention until someone asked "why is its depth bouncing up, down, back up, and why are the scores all over the place?" A quick and careful look explained what was going on.


In the case of a passive cluster you cannot really make it play weaker than its core Master machine unless you are an idiot.

It is actually not hard at all to be quite good and still play worse. message-passing overhead can creep in without your knowing. It's happened many times here as people use MPI/openMP/etc to do parallel programs that are worse as more nodes are added.
You can setup the different other engines to be experts in special fields and only get activated if and only if a certain pre-determined change of evaluation trigger is reached. I can think of a scenario where such an option would immediately add a few ELO points to a master engine if you use the Master without EGTB's and have the EGTB's handled with an independent slave computer. Not having to search EGTB's alone would make the master computer stronger ... and ofcourse it will still have the safety net of being able to solve any resulting 6 piece EGTB by the slave. So an active cluster sharing memory does not necessarily have to be the only way to use additional cores, passive might be the way to go forward.

The strongest chess ever played has probably been in Freestyle chess ... a good Freestyle chess player usually has several motherboards running independently ... and usually runs several flavors of engines to make sure all moves get a sanity check ... and will sometimes probe a position to see what it will reveal. In certain situations that person will switch from one engine to another depending on the dynamics of the game. He will know which engine performs better where. Rybka is blessed because in its arsenal it has Rybka Human, Rybka Default and Rybka Dynamic. Since Vas is one of the top Freestyle players, I am sure he understands best how to try to mimic a Freestyle player, and the 100 ELO points would have resulted in actual testing against a simple 8 core Rybka 3. The Cluster Rybka by the way has played quite a few tournaments on Playchess where it faced other Octa Rybka's and apparently came out ahead quite easily. So the 100 ELO gain does seem to be reasonable.
Freestyle is a completely different animal. Totally unrelated to playing a single game against a single opponent, and playing the best move you can play using all available hardware. For oddball things, different solutions are possible. But for playing a single game, this is not a new topic. There is almost 30 years of work on distributed chess available, with probably every idea one might ever think of already tried and tested.

100 elo against a single opponent in normal chess is not going to come from this approach. You need a speedup of a factor of 4 to reach that. 5 nodes splitting at the root will be hard-pressed to produce a speedup of even 1.5x...
I think Vas stated that, for Rybka presumably, he can get get 75 elo for each doubling so that makes the necessary speed-up already a lot less. I do not want to betray too much of Rybkas inner workings, so I am glad I do not actually know them but, speculatively, I suppose that some of the general structure of Rybkas search makes the distributed cluster-setup also more feasible. Rybka calculates with very varying depths, so it has to accomodate for the fact that alpha beta in general supposes every move gets searched to the same depths. If you want to do aggressive pruning some of this will have to go, that seems obvious to me.

For Vas I'm hoping that I'm not giving away big secrets here, also this seems rather straightforward :? Anyway my bet is that the slower nodes are used almost exclusively for safety checking of all nodes that are in principle labeled as 'refuted'. That way it does not matter if the plydepths are out of sync, because in the non-cluster version of Rybka, the iterations would have more or less stopped for these (searchtree-)nodes. That is the only way I can see Rybka reach its enormous depths :!: . The Cluster Rybka version is in this sense really a safer version of normal Rybka MP, as several people on Rybka forum also thought they could observe from watching it play.


Eelco
Unfortunately, if you were to read the past 30 years worth of parallel search research papers, you'd discover why this won't work in any shape, form or fashion. Everything you mentioned has been tried, and found wanting. And no "program structure" is going to make a bad idea work. I also don't know what he gets per ply. Generally that is a factor of 2 speedup, which is usually 50-75 Elo. But this split at the root search is not going to get a factor of 2. And an individual node is not going to be able to make the search "safer" since one can't search any deeper than another. But that's beside the point. At least the algorithm used is obvious now. I had originally thought he might have actually done a true distributed tree search. Turns out not.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

Elo is a measure, but not for parallel search, so I can't tell a thing about how the thing scales in terms of 1, 2 4 and 8 processors...