Cluster Rybka

Eelco de Groot · Post by **Eelco de Groot** » Wed Nov 12, 2008 4:00 am

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:Those numbers sound pretty reasonable. I'm not so happy with "those" that report numbers that are simply fictional, and which anybody that has done any reading or research into parallel search could immediately recognize as bogus.

I still hope that one day someone will post some real numbers on Rybka's parallel speedup on an 8-way box, by running some "normal" positions to a fixed depth using 1, 2, 4 and 8 processors. He claims to scale better than any other program. Somehow I doubt it. Maybe "as good as" if he is lucky. But so far we just have urban legend to go on. Speedups for my program are quite easy to produce and anybody can do it.
I think that fixed depth may be misleading because rybka may play better at fixed depth with more cores thanks to doing less pruning at the same depth.

It is possible to test it simply by playing fixed depth match between
rybka single core and rybka 4 cores.

Uri
I was more interested in the speedup scaling, since Vas has claimed publicly that his speedup is better than anybody else's... I just don't have any good boxes with windows or I would try that myself...
The problem is that fixed depth search times may be misleading because the same depth with 4 cpu does not mean the same as the same depth with 1 cpu
and the only good test to find effective speed up is by games between
rybka 4 cpu and rybka 1 cpu with unequal time control.

For example
If rybka 4 cpu can win by result like 5300:4700 after 10,000 ponder off games with 3:1 time handicap then you can say that the effective speed up is more than 3:1 and it may be a good idea to try 3.5:1 time handicap.

I have no time for this type of test and hopefully other can do it.

Uri
the problem is that from a parallel processing research point of view, "speedup" is _the_ number we want to see. That is a linear function, whereas Elo is not necessarily linear. Everyone seems to believe in diminishing returns, which means Elo doesn't linearly increase with speedup/search depth. Comparing SMP searches can only be done with time-to-ply measurements... We don't really care what the strength improvement is, just what is the parallel speedup...
time to ply measurement means nothing if the program does not play the same move at the same depth.

It is possible that smp rybka play better moves at depth 10 relative to single processor rybka because smp rybka does less pruning.

Uri
this is rare, but does happen. But it does not "mean nothing". There is no other viable way to measure parallel speedup. we don't need guesses, approximations, and such, when precise numbers are easy to obtain...

If it plays better moves by pruning less, then the sequential algorithm ought to prune less and play better as well. This argument is circular and leading nowhere...

Bob I think if one tries try to fully "emulate" a sequential algorithm, like alpha beta is, on a parallel system, you need a lot of communication between the processes, simply because of this sequential nature? Rybka Cluster runs on "ordinary" PCs, but fast ones, probably some octals but also maybe a few slower or smaller Quads, all owned by Lukas Cimiotti. So you start with uneven capabilities per clusternode. Maybe that is not true and they are five full octals, but that does not change the argument substantially. I think the computers communicate by some network setup with limited bandwith. Don't know much about it, but I think that with your Infiniband cluster Crafty could do a lot more, communication-wise? So I think it is a trade-off and Cluster Rybka just has to do the best she can with very limited message passing. The CPUs just need to work more independently than in a full sequential algorithm. In your example you would like the CPUs to work for 90% of their time on the first move, dividing the work very closely but this creates too much overhead in the PC-cluster.

Eelco

bob · Post by **bob** » Wed Nov 12, 2008 7:39 am

Eelco de Groot wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:Those numbers sound pretty reasonable. I'm not so happy with "those" that report numbers that are simply fictional, and which anybody that has done any reading or research into parallel search could immediately recognize as bogus.

I still hope that one day someone will post some real numbers on Rybka's parallel speedup on an 8-way box, by running some "normal" positions to a fixed depth using 1, 2, 4 and 8 processors. He claims to scale better than any other program. Somehow I doubt it. Maybe "as good as" if he is lucky. But so far we just have urban legend to go on. Speedups for my program are quite easy to produce and anybody can do it.
I think that fixed depth may be misleading because rybka may play better at fixed depth with more cores thanks to doing less pruning at the same depth.

It is possible to test it simply by playing fixed depth match between
rybka single core and rybka 4 cores.

Uri
I was more interested in the speedup scaling, since Vas has claimed publicly that his speedup is better than anybody else's... I just don't have any good boxes with windows or I would try that myself...
The problem is that fixed depth search times may be misleading because the same depth with 4 cpu does not mean the same as the same depth with 1 cpu
and the only good test to find effective speed up is by games between
rybka 4 cpu and rybka 1 cpu with unequal time control.

For example
If rybka 4 cpu can win by result like 5300:4700 after 10,000 ponder off games with 3:1 time handicap then you can say that the effective speed up is more than 3:1 and it may be a good idea to try 3.5:1 time handicap.

I have no time for this type of test and hopefully other can do it.

Uri
the problem is that from a parallel processing research point of view, "speedup" is _the_ number we want to see. That is a linear function, whereas Elo is not necessarily linear. Everyone seems to believe in diminishing returns, which means Elo doesn't linearly increase with speedup/search depth. Comparing SMP searches can only be done with time-to-ply measurements... We don't really care what the strength improvement is, just what is the parallel speedup...
time to ply measurement means nothing if the program does not play the same move at the same depth.

It is possible that smp rybka play better moves at depth 10 relative to single processor rybka because smp rybka does less pruning.

Uri
this is rare, but does happen. But it does not "mean nothing". There is no other viable way to measure parallel speedup. we don't need guesses, approximations, and such, when precise numbers are easy to obtain...

If it plays better moves by pruning less, then the sequential algorithm ought to prune less and play better as well. This argument is circular and leading nowhere...
Bob I think if one tries try to fully "emulate" a sequential algorithm, like alpha beta is, on a parallel system, you need a lot of communication between the processes, simply because of this sequential nature? Rybka Cluster runs on "ordinary" PCs, but fast ones, probably some octals but also maybe a few slower or smaller Quads, all owned by Lukas Cimiotti. So you start with uneven capabilities per clusternode. Maybe that is not true and they are five full octals, but that does not change the argument substantially. I think the computers communicate by some network setup with limited bandwith. Don't know much about it, but I think that with your Infiniband cluster Crafty could do a lot more, communication-wise? So I think it is a trade-off and Cluster Rybka just has to do the best she can with very limited message passing. The CPUs just need to work more independently than in a full sequential algorithm. In your example you would like the CPUs to work for 90% of their time on the first move, dividing the work very closely but this creates too much overhead in the PC-cluster.

Eelco

1. heterogeneous clusters are common, where the machines are a mixture. However there are known approaches to using machines of different speeds efficiently... The original "beowulf project" ("A pile of PCs" which was a paper written by NASA) was based on this very principle in fact.

2. Dividing at the root prevents gettng much of a speedup. Certainly dividing near the tips is also bad if communication is slow. But there is a happy median as deep blue had the _same_ issue on the IBM SP. I'll have some data in a few months as I am fixing to tackle this issue because we have too much horsepower in this cluster to not harness it for chess tournaments...

M ANSARI · Post by **M ANSARI** » Wed Nov 12, 2008 10:15 am

Again I think you assume that the Rybka cluster is active ... what if it is passive. In a passive cluster communication and bandwidth speed is not critical at all ... it only kicks in when a certain pre-determined trigger is reached. This is totally totally and again totally different than having to try and optimize how the search tasks are shared over a network which uses shared memory ... that most definetely would require a tremendous amount of revolutionary programming to offset the obvious latencies. Why make it complicated? Make it simple! I mean first try to do no harm ... if I had 5 equally endowed hardware platforms I would start off by configuring one of them as a Master as 3 PV ... would then use a script to glean off the first 3 PV's of the master and run them as single PV mode. That alone would improve my search tree ... if not by deeper search, then by a more likely "hit" of a good move by simple statistical probability of the non determenistic nature of MP engines. The Master and the 3 Slave PC's would all be run without EGTB's ... which should give added strength. Yet I would have the last PC running with all 6 EGTB's to provide a safety net in case an EGTB hit becomes critical. If I had more PC's I could use them by increasing he PV's of the Master ... if I had even more PC's I would put them to use as mini PV of the Slaves and so on. I could also add additional PC's with engines tuned to particular specialties. The obvious problem would be to how to design the artificial parser or the trigger of the passive evaluatio function ... but that does not seem too difficult of a task. You start of very conservatively and tune the function by testing it. This is a totally different approach to normal bandwidth hogging extremely difficult to program parallel architecture ... yet it probably will give just as good of a performance increase as massively parallel architecture. I don't really know what path Vas has taken in his cluster, but I have a feeling that he has gone the Master - Slave passive route. He could have ofcourse found ways to improve on a purely passive approach by setting up a system where search results in hash could be somehow saved and tabulated and available cross platform, I can think of several ways where this could be done circumventing large latency hits. The point is that there are many ways to skin a cat ... and not necessarily all of them are active parallel search.

bob · Post by **bob** » Wed Nov 12, 2008 5:44 pm

M ANSARI wrote:Again I think you assume that the Rybka cluster is active ... what if it is passive. In a passive cluster communication and bandwidth speed is not critical at all ... it only kicks in when a certain pre-determined trigger is reached. This is totally totally and again totally different than having to try and optimize how the search tasks are shared over a network which uses shared memory ... that most definetely would require a tremendous amount of revolutionary programming to offset the obvious latencies. Why make it complicated? Make it simple! I mean first try to do no harm ... if I had 5 equally endowed hardware platforms I would start off by configuring one of them as a Master as 3 PV ... would then use a script to glean off the first 3 PV's of the master and run them as single PV mode. That alone would improve my search tree ... if not by deeper search, then by a more likely "hit" of a good move by simple statistical probability of the non determenistic nature of MP engines. The Master and the 3 Slave PC's would all be run without EGTB's ... which should give added strength. Yet I would have the last PC running with all 6 EGTB's to provide a safety net in case an EGTB hit becomes critical. If I had more PC's I could use them by increasing he PV's of the Master ... if I had even more PC's I would put them to use as mini PV of the Slaves and so on. I could also add additional PC's with engines tuned to particular specialties. The obvious problem would be to how to design the artificial parser or the trigger of the passive evaluatio function ... but that does not seem too difficult of a task. You start of very conservatively and tune the function by testing it. This is a totally different approach to normal bandwidth hogging extremely difficult to program parallel architecture ... yet it probably will give just as good of a performance increase as massively parallel architecture. I don't really know what path Vas has taken in his cluster, but I have a feeling that he has gone the Master - Slave passive route. He could have ofcourse found ways to improve on a purely passive approach by setting up a system where search results in hash could be somehow saved and tabulated and available cross platform, I can think of several ways where this could be done circumventing large latency hits. The point is that there are many ways to skin a cat ... and not necessarily all of them are active parallel search.

This "active" vs "passive" is nonsense.

But that aside, I am looking at precisely one detail. Vas claimed 100 Elo for the cluster. That represents a roughly 3x speedup. I can say, with 100% certainty, that splitting _only_ at the root will _not_ produce a 3x speedup using 5 nodes. Or 10 nodes. Or 100 nodes. _anybody_ that has done any research in this area would reach the same conclusion almost instantly.

As far as the "first, do no harm" I would agree 100% that should be a goal. I am not convinced he has practiced that however. Here's why: Each node is searching root moves to _different_ depths. How do I know? I watched the output _carefully_. So that's an observed fact, not a guess. The problem is this: what do you do when node A searches move M1 to depth 19 with a score of +1, node A searches move M2 to a depth of 21 with a score of -.2? This is exactly what happens, and it is a serious problem. If you play move M1 because of the better score, by the time you get to depth 22 (if you could) you might find the score at -2. So you lose. If you play the move from depth 22 instead, with a score of -.2, you do know that -0.2 is not dead lost, but what if M1 is +3.0 by the time you get it out to 22 ply (again, which you can't)?? You miss a win. Monty Newborn first tried this "unsynchronized" parallel search in the late 1970's and decided "no good". Jonathan Schaeffer then worked on a distributed search through the 80's in Sun Phoenix. He split the search into two parts, a normal search, and a stripped-down tactical search. The stripped down search could get a couple of plies deeper. Now he has the same problem. The regular search gets a score of +.3 after some depth. The tactical search gets a score of -1.3 for the same (best) move after 2 extra plies. What to do? Tell the normal searcher "start over, but exclude that move from your root move list, it loses material". It might be that _all_ moves are going to lose that pawn. Or the tactical search goes 2 plies deeper and finds a different best move than the regular search. Now what? Again, searching two different root moves to two different depths is too full of problems to be considered.

The rest of what you wrote sounds good, but is simply wrong. This is a known issue,it has been addressed for almost 30+ years, and it doesn't work. If you are very careful, using a program with a branching factor of 5-6 (1980's era branching factor) you can average a speedup of 1.5X splitting at the root (and not doing unsynchronized search). With todays sub-2.0 branching factor, the potential speedup is barely measurable. The first move generally takes over 3/4 of the total search time. If you search 5 moves at the root in parallel. with no alpha/beta bounds since you won't have one until the first root move has been searched, then the first 5 moves are going to take exactly as long as the first single move would normally take. You have now used 75% of the available time. Now you parallel search the remainder of the moves, and you divide that 25% (usually much less, but I'll stay "optimistic" here to provide the best possible speedup) and this takes about 5% where before it took 25%, thanks to the 5 nodes searching in parallel. Your total search took 80% of the time it normally takes. Speedup = 1 / .80 = 1.25X faster. And that is _optimistic_.

I don't know where the rest of your propaganda came from, but asymmetric search is a lousy way to use a cluster. One with EGTBs, 4 without. Different nodes using different tuning. It is all complete nonsense because somehow you have to choose a best move, and when you change the program on each node, they are going to reach different depths. That's a critical failing point...

The concept of "Master - Slave" is bad enough because of the excessive wait time you accumulate, but having N different types of slaves simply will not help overall performance enough to be measurable. What is 1.25x speedup in terms of Elo? 10-15 maybe? Better than nothing, but not by a lot considering the cost of the additional hardware.

M ANSARI · Post by **M ANSARI** » Wed Nov 12, 2008 7:04 pm

If you have tried to play a Freestyle game using several PC's with different engine setups I think you will not think a passive cluster is "nonsense". No progaganda here, I don't speak for Vas and am simply trying to think out loud how he might have setup his cluster. I know Vas is one of the top Freestyle players and it makes sense to me that he would go with the approach of doing no harm as a priority when he more PC's or more cores.

Forget everything you know about parallel search and try to think outside the box for just a few minutes. The most important thing is to make sure that you don't reduce the strength of your original Master computer. If you try to play a Freestyle game and have a few engines running you will notice that every once in a while one of the PC's will see a move that the main computer will not. If you manually take that move and put it on the main computer to look at it will quickly agree with that move and incorporate that as the best move. In the case of Rybka, we had Rybka Winfinder which was amazing in finding tactical shots ... in some cases 10X or 20X faster than normal Rybka 2.3.2a on equivalent hardware ... yet it would play weaker overall than the default Rybka. I am sure that it is not difficult for a talented programmer to figure out a way to write some lines of code to mimick what a freestyle player does. Ofcourse someone like Vas who is versed in the actual setup of the engine could device different engines that can be tuned to excel in different parts of the game.

While some might think that the future of incoporating additional cores is in increasing depth of search, I tend to think that you reach a point of diminishing returns at some point ... where this point is I am not sure but it could possibly be in this generation of top end PC's or maybe in the next generation. In long time control games with very heavy hardware, tactical shots which are possible with deep search are very rare ... so vast search ability is not what is most important ... the quality of search or being able to note which lines are important and which can be discarded is much more critical. That is why IMHO the added cores that will be available in large quantities soon, could be put to better use by using different independent engines that are specialists in different categories. I guess there will always be some who think that brute force and massively parallel architectures sharing hash and memory are the way to go ... but I tend to think that the task of doing that efficiently is daunting and extremely difficult. Maybe there is an easier way.

bob · Post by **bob** » Wed Nov 12, 2008 7:37 pm

M ANSARI wrote:If you have tried to play a Freestyle game using several PC's with different engine setups I think you will not think a passive cluster is "nonsense". No progaganda here, I don't speak for Vas and am simply trying to think out loud how he might have setup his cluster. I know Vas is one of the top Freestyle players and it makes sense to me that he would go with the approach of doing no harm as a priority when he more PC's or more cores.

First, the ground rules. I am, and always have been talking about "standard chess". Not freestyle. Not man-machine combined. Not anything other than two opponents playing each other across the board using the normal FIDE rules of chess. I don't care about freestyle or any other "different" way of playing the game. In the event over the weekend, Rybka was _not_ playing freestyle chess. It was "normal chess". And that is what my comments are directed toward.

Forget everything you know about parallel search and try to think outside the box for just a few minutes. The most important thing is to make sure that you don't reduce the strength of your original Master computer. If you try to play a Freestyle game and have a few engines running you will notice that every once in a while one of the PC's will see a move that the main computer will not. If you manually take that move and put it on the main computer to look at it will quickly agree with that move and incorporate that as the best move. In the case of Rybka, we had Rybka Winfinder which was amazing in finding tactical shots ... in some cases 10X or 20X faster than normal Rybka 2.3.2a on equivalent hardware ... yet it would play weaker overall than the default Rybka. I am sure that it is not difficult for a talented programmer to figure out a way to write some lines of code to mimick what a freestyle player does. Ofcourse someone like Vas who is versed in the actual setup of the engine could device different engines that can be tuned to excel in different parts of the game.

Again, I don't need to "think outside the box" so long as we are talking about _normal_ chess, which was obviously what I am discussing. I'm not interested in multiple engines, multiple variations, etc... Just a single opponent that can play as strongly as possible. For freestyle, things might well be different. But we aren't playing freestyle tournaments for (say) the WCCC, the ACCA, the CCT, etc... And that is where I am coming from with respect to parallel search.

While some might think that the future of incoporating additional cores is in increasing depth of search, I tend to think that you reach a point of diminishing returns at some point ... where this point is I am not sure but it could possibly be in this generation of top end PC's or maybe in the next generation. In long time control games with very heavy hardware, tactical shots which are possible with deep search are very rare ... so vast search ability is not what is most important ... the quality of search or being able to note which lines are important and which can be discarded is much more critical. That is why IMHO the added cores that will be available in large quantities soon, could be put to better use by using different independent engines that are specialists in different categories. I guess there will always be some who think that brute force and massively parallel architectures sharing hash and memory are the way to go ... but I tend to think that the task of doing that efficiently is daunting and extremely difficult. Maybe there is an easier way.

There is always an "easier way". Unfortunately it is always "not as good as the right way" either... We still today see cases of A out-searching B. Parallel search can help, done correctly.

Dirt · Post by **Dirt** » Wed Nov 12, 2008 10:04 pm

bob wrote:First, the ground rules. I am, and always have been talking about "standard chess". Not freestyle. Not man-machine combined. Not anything other than two opponents playing each other across the board using the normal FIDE rules of chess. I don't care about freestyle or any other "different" way of playing the game. In the event over the weekend, Rybka was _not_ playing freestyle chess.

What's so special about freestyle? Consider a freestyler in the abstract; We perhaps have one 2200 Elo node (human) and three 3 3200 Elo nodes (Rybka). Somehow as a cluster they end up playing at a 3300 Elo level. Could we replace the human node with a computer node using similar methods and get an equal or better result? Maybe not, the humans task might be as difficult as random image identification. But if isn't so difficult, then we get a computer cluster playing freestyle type chess with no human involved.

Of course, I'm totally making up these numbers, but freestylers do seem to improve the "cluster" enough to usually win.

bob · Post by **bob** » Thu Nov 13, 2008 2:38 am

Dirt wrote:
bob wrote:First, the ground rules. I am, and always have been talking about "standard chess". Not freestyle. Not man-machine combined. Not anything other than two opponents playing each other across the board using the normal FIDE rules of chess. I don't care about freestyle or any other "different" way of playing the game. In the event over the weekend, Rybka was _not_ playing freestyle chess.
What's so special about freestyle? Consider a freestyler in the abstract; We perhaps have one 2200 Elo node (human) and three 3 3200 Elo nodes (Rybka). Somehow as a cluster they end up playing at a 3300 Elo level. Could we replace the human node with a computer node using similar methods and get an equal or better result? Maybe not, the humans task might be as difficult as random image identification. But if isn't so difficult, then we get a computer cluster playing freestyle type chess with no human involved.

Of course, I'm totally making up these numbers, but freestylers do seem to improve the "cluster" enough to usually win.

It is a different game, however. Finding the one best move from one single engine is a different issue and is the one that is (to me) interesting...

Michael Sherwin · Post by **Michael Sherwin** » Thu Nov 13, 2008 5:52 am

bob wrote:
krazyken wrote:One place I'd expect a cluster to be more useful is in pondering. You should be able to get a 100% ponder hit with enough nodes.
doesn't work. If you spread nodes across all moves you ponder, you search less deeply which hurts rather than helps, since each different move will only be pondered with one node, not the entire group.

Useing 8 cores to search each of 5 ponder moves. What would be the ponder hit rate then?

M ANSARI · Post by **M ANSARI** » Thu Nov 13, 2008 7:03 am

bob wrote:
Dirt wrote:
bob wrote:First, the ground rules. I am, and always have been talking about "standard chess". Not freestyle. Not man-machine combined. Not anything other than two opponents playing each other across the board using the normal FIDE rules of chess. I don't care about freestyle or any other "different" way of playing the game. In the event over the weekend, Rybka was _not_ playing freestyle chess.
What's so special about freestyle? Consider a freestyler in the abstract; We perhaps have one 2200 Elo node (human) and three 3 3200 Elo nodes (Rybka). Somehow as a cluster they end up playing at a 3300 Elo level. Could we replace the human node with a computer node using similar methods and get an equal or better result? Maybe not, the humans task might be as difficult as random image identification. But if isn't so difficult, then we get a computer cluster playing freestyle type chess with no human involved.

Of course, I'm totally making up these numbers, but freestylers do seem to improve the "cluster" enough to usually win.
It is a different game, however. Finding the one best move from one single engine is a different issue and is the one that is (to me) interesting...

That is true ... and massively parallel engines that can efficiently distribute their workload among infinite cores and share memory and hash will always be the holy grail. Unfortunately there are some latency issues that with today's hardware that make this task extremely difficult. There will always be a place for the single most powerful independent chess engine ... but there will also always be a place for creating the most powerful chess playing entity.

You are correct that they are different things ... the most powerful chess entity could be a simple script or driver that uses several totally different engines on different motherboards to come up with the statistically the strongest move. If a 2000 ELO player can play around 3300 ELO ... or at least 100 ELO stronger than the strongest available engine ... what would happen if an effective script or driver could be written to duplicate that feat using "artificial intelligence" that can duplicate what a human does. I would think that today's engines would be much stronger than 2000 ELO statically ... I really don't think that it would be that hard to have a program play to even 2400 ELO with minimal search ... or at least to improve on the 2000 ELO of a human.

It would be great if someone could come up with such software, then everyone would be able to create their own powerful chess entities with all sort of engine setups ... not necessarily all from the same engine author.

Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka