Clustering etc. thread

Vasik Rajlich · Post by **Vasik Rajlich** » Wed Mar 11, 2009 7:13 am

bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
Sorry, but no you won't. The math is simple and comes directly from Amdahl's law and the original alpha/beta analysis by knuth and moore, followed by the parallel analysis I did in the Journal of parallel computing. The first move takes 50% of the time. And that is going to be done on one processor. if you can somehow solve the "window" problem, so that you can search the remainder of the ply-1 moves in parallel with a correct alpha/beta window, then you can shrink those to the time required to search one of them. But you have no alpha/beta window, so all moves searched in parallel (ignoring the first move) are going to take about as long as the first move (each one will take that long) because you do _not_ yet have the alpha bound from the first move to let you get the quick cutoffs on the rest of the move.

Best case is 2x faster, since you could assume that any root move takes as long as any other when you do not have a good alpha bound. And what you can get "peak" is not going to be what you get "on average". >2x is just not going to happen except for rare cases. I have a position somewhere where you can get a 20x speedup like that. First move takes forever to find a very deep mate. Second move is a shorter mate in 1/20th the total nodes searched. Searching both at the same time finds the second mate before the first has even been "sniffed". But that is simply an exception. For the general case, 2x is the very best you can hope for, and it is not going to happen often...

I can give you a citation for the paper I wrote that uses the math from Knuth / Moore and extends it to cover alpha/beta in parallel. It is easy enough to understand and explains exactly why >2x is a rare case, not the norm...

The point is that that 2x is not the upper bound. You'll have 2x more time to dedicate to searching the best move and maybe 20x more time to dedicate to searching alternative moves.

It's not hard to do a simulation. Crafty A always plays its move after 10 seconds of search using his normal algorithm, while Crafty B spends 10 seconds on every single root move and then plays the one with the best score.

I would be curious to know the result here.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Wed Mar 11, 2009 7:16 am

bob wrote:
Uri Blass wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
No program today have search that is good enough but in theory
if you have a good search the speed up may be smaller from splitting at the root for the simple reason that the value of 1.5x speed improvement is bigger than the value of one ply with no pruning.

I think that
some type of bad evaluation also can cause the speed up to be smaller
or even negative.

Imagine that you search with no pruning and extensions and no qsearch and imagine that you have an evaluation that gives no bonus for the side to move so practically you often evaluate even depth as better than odd depths(maybe it should be the opposite and I did not think about it but the idea is the same)

If you split at the root you may get depth 7 for move A and depth 8 for move B and prefer move B not because it is better but only because you searched to even depth.

This problem does not happen without splitting at the root because without splitting at the root you always get the same depth for all moves.

Uri
You are badly misusing terminology.

1. "splitting at the root" does not mean each root move gets searched to different depth. I split at the root in current crafty. that only means that I do a parallel search on the root moves as well as deeper moves, because this is a far more efficient way to search.

2. You are using what is called "unsynchronized search" where each node searched a different move (or set of moves) at the root, and when iteration N is done, N+1 is started without regard for how the other moves are progressing on other nodes.

this is useless.

How can you choose between depth 21, eval=+1.3, and depth 19, eval +2.5?? You can't. This has been tried in the past, by Newborn, by Schaeffer, and by others. There is no way to compute any sort of equivalence function so that you can decide which of the above is better. The depth 19 move might be even higher at depth 21. Or it might be way lower. The only way to discover this is to search both moves to the same depth. Anything else is beyond hopeless and is a coin flip. You can't even choose between depth 21, +1.5, and depth 19, -2.0, because by depth 21 the -2.0 score might be +5.0...

I use the same terminology that Uri does. Splitting at the root means splitting at the root. One unit handles moves a, b, c, another handles moves d, e, f. What the units do with those moves is an implementation detail.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Wed Mar 11, 2009 7:20 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

If both hold, you can get a speedup of N, where N is the number of times you actually change the best move at the root. Typically that is none in roughly 80% of the searches according to analysis on the "deep projects" from years ago, "Crafty goes deep" and then Heinz's "DarkThought goes deeper" or whatever it was called. So with luck, 20% of time time (actually more like 15%) you will change your mind at the root once, and get a 2x speedup. 80% of the time you get no speedup to speak of.

The bad part of that is that the more nodes / processors you add, you get nothing for them unless you change your mind at least once for each node so that all those moves have to be searched completely with no early exits due to cutoffs.

In 1978 when we started this, our branching factor was closer to 10 than to 2. So we potentially got more then than now. Today with EBF hovering around 2.0 or so, the first move takes >50% of the time. Which means all you can speed up is that remaining 50% which is not going to accomplish much, performance-wise. For example, one processor takes 20 minutes to complete a search. Two will take 15 minutes. Four will take 12.5 minutes. An infinite number could complete the search in 10 minutes minimum, and that is a stretch to assume all of those branches would produce trees no larger than the first move, which is unlikely.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
What you can achieve with infinite number of processors by splitting at the root is one ply and one ply can be equivalent to more than speed improvement of 2:1
IF_AND_ONLY_IF the condition I gave is satisfied, that is you change your mind at the root. Otherwise you can never beat 2x in normal positions. This is basic alpha/beta math.

The branching factor is irrelevant because
one ply+ searching to depth n may be practically better than searching to depth n+1 because searching to depth n+1 may have more pruning.

Who cares? The basic idea is to continue to use the same search strategy you already use. Otherwise you are admitting that your current search paradigm is not optimal. And you ought to fix that _first_ before you try to implement a poor search algorithm as a parallel search algorithm.

practically the depth is not constant but you can try by yourself to test version A against version B when version A is using 2 seconds per move and version B is searching for 1 second after every root move and see who wins(version B should use something like 100 processors to be practically sure that the number of root moves is smaller than the number of processors but the interesting question is if version B can win the match or not).
If it wins then you can increase the time of version A to see
the effective speed up when version A score near 50% against version B.
Again, you can not average 2x faster splitting at the root. You might hit 2x, or on very rare occasions something beyond 2x. But you will _not_ average anywhere near 2x. And you will do well to average 1.5 x which is what we got back in the days where the effective branching factor was closer to 10, which makes parallel speedup easier to get.

Uri
I see no reason to continue to use the same search with a cluster when I admit that my algorithm with a cluster is not optimal but I have no time to design optimal algorithm with a cluster.

Not using the same search is not admitting that my search is not optimal without a cluster because the conditions are different.

Without a cluster not pruning after the first ply may cause me to be too slow and get smaller depth.

When I use a cluster with many nodes(let say 100 nodes to allow me to do full search of the root moves in all practical cases) I get the extra ply for free.

You can say that it is possible to do better with a cluster and you are right but I do not claim that speed improvement that you get by splitting at the root is the best that you can get by a cluster but only
that I see no reason to claim that speed improvement of more than 2:1 by splitting at the root is theoretically impossible.

I also see no reason that the hash cannot be saved after splitting at the root(if I understand correctly this was suggested in this thread).

one node analyzed the right move and this node has hash information.

I guess that it may be possible to copy the hash information of this node to all the nodes and use it in the next search(there is obviously connection between the nodes otherwise they cannot work together).

Uri
Uri is basically 100% correct here, even though he hasn't even worked on the problem.

The lack of shared memory imposes new constraints on your search and changes the things you need to do. In particular, without shared memory, you need to split the search tree into chunks which will remain relatively stable from iteration to iteration.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Wed Mar 11, 2009 7:27 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
No program today have search that is good enough but in theory
if you have a good search the speed up may be smaller from splitting at the root for the simple reason that the value of 1.5x speed improvement is bigger than the value of one ply with no pruning.

I think that
some type of bad evaluation also can cause the speed up to be smaller
or even negative.

Imagine that you search with no pruning and extensions and no qsearch and imagine that you have an evaluation that gives no bonus for the side to move so practically you often evaluate even depth as better than odd depths(maybe it should be the opposite and I did not think about it but the idea is the same)

If you split at the root you may get depth 7 for move A and depth 8 for move B and prefer move B not because it is better but only because you searched to even depth.

This problem does not happen without splitting at the root because without splitting at the root you always get the same depth for all moves.

Uri
You are badly misusing terminology.

1. "splitting at the root" does not mean each root move gets searched to different depth. I split at the root in current crafty. that only means that I do a parallel search on the root moves as well as deeper moves, because this is a far more efficient way to search.

2. You are using what is called "unsynchronized search" where each node searched a different move (or set of moves) at the root, and when iteration N is done, N+1 is started without regard for how the other moves are progressing on other nodes.

this is useless.

How can you choose between depth 21, eval=+1.3, and depth 19, eval +2.5?? You can't. This has been tried in the past, by Newborn, by Schaeffer, and by others. There is no way to compute any sort of equivalence function so that you can decide which of the above is better. The depth 19 move might be even higher at depth 21. Or it might be way lower. The only way to discover this is to search both moves to the same depth. Anything else is beyond hopeless and is a coin flip. You can't even choose between depth 21, +1.5, and depth 19, -2.0, because by depth 21 the -2.0 score might be +5.0...
I can certainly choose and the simplest choice is to choose higher score is better.
depth 19 eval=+2.5 is better than depth 21 eval=+1.3 for the simple reason that 2.5>1.3

I believe that with relatively good evaluation it is not so bad choice(not that splitting at the root is good but it may give effective speed improvement of 2 or more than 2 assuming that all root moves are searched).

I may be wrong and the only way to know is by testing.

Uri
There is absolutely no doubt you are wrong. How many times have you seen a program search a move to depth N with a + score, then at depth n+1 the score drops off the map and it switches to a new move? If you believe what you wrote, why not stop your search when the eval gets to a good point? Why continue to search deeper?

As far as a speedup of > 2, it simply will not/can not happen. Amdahl's law says that the overall speedup for an algorithm is:

speedup = time(N) / T(P)

T(P) = sequential_processing_time + parallel_processing_time / N

where N = number of processors.

If the first move takes 50% of the time, and that is the usual case, then you absolutely can not get a speedup > 2.0. It is impossible. Except for those cases where the first move is not best. Now you search the second move. And it takes the same amount of time. So you do get a bigger boost there. But that only happens about 15% of the time. 85% of the time the speedup is going to be way below 2.0 no matter how many CPUs you use. And there is simply no way to escape the basic sequential property of the alpha/beta algorithm.
Of course the score may change but by the same logic you also cannot compare between scores at the same depths because the program may change its mind in later depth.
I don't see where this nonsense comes from. Two moves, different scores, equal depths. I most certainly can compare them. And choose the largest. But if the depths are different, and the scores are different, I can only flip a coin to choose one. Choosing the best is wrong. Choosing the deepest depth is wrong. Choosing one of the two moves with different depths is just a purely random choice.

You do not know the best move and you always guess based on information and hope to be right.
I trust the search to at least search the two moves in an equal way, so that the resulting backed-up scores can be compared. That's what minimax and alpha/beta is all about.

If it is not the case then you cannot also compare scores at the same depth because practically one move is searched to normal depth and another move is searched to smaller depth because of late move reductions.

It may be bad to compare score in different depth with
bad evaluation and no reductions but it is not the practical case(maybe it was the case 20 or 30 years ago).

Uri
It is still the case today. And I am amazed you don't understand why...
1)With late move reductions even searching to the same depth can be practically different depth so if you cannot compare different depth then you also cannot compare the same depth.

2)I think that an interesting experiment that everybody can do may be to test A against B when A search at fixed number of nodes without splitting the root(let say 100,000 nodes per move) and B search after every root move at 50,000 nodes per move.

B choose a move simply by comparing the scores without caring about the fact that depths are different.

I do not know if A wins or B wins and it may be dependent on the program but I see no reason to assume that it is obvious that B cannot win.

Assuming nodes are nearly proportional to time if
B wins then it suggests that splitting at the root can give speed up of more than 2:1.

Uri

Yes, exactly. I could not have said it better. This is even (slightly) better than the experiment I suggested above. Uri, you should build a cluster.

Of course, nobody is arguing that the best algorithm is pure root splitting. It's a very important building block, though. And data on this phenomenon (which I actually don't have) would be very interesting.

Vas

Tony · Post by **Tony** » Wed Mar 11, 2009 8:22 am

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The problem is, I am not basing my opinion on a single position. I am basing it on a whole game I watched in the last ACCA online event. Where Rybka was kibitzing PVs that showed _exactly_ how the search was being split and where...
I think that you cannot get conclusions based on pvs that rybka showed.

It is possible that rybka gave only part of the information in the pvs that you saw and the algorithm is based on combination of splitting in the root and something else.

Uri
It is possible Rybka had a GM playing the moves and supplying nonsensical kibitzes as well. But it is not very likely. This is going nowhere. I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look. There's no doubt in this case as to what is going on. There is no other possible explanation. Feel free to offer _any_ reason how one could get a PV for depth-23, then a PV for depth=19, then a PV for depth=21, then a PV for depth=18, were if you take the best moves, and collect them individually, you can see one cluster node searching a group of moves in the normal way, kibitzing a PV after each iteration, another node doing the same for a different group of nodes. Etc. Or offer any explanation of why a program would kibitz (intermingled with the kibitzes of other nodes at different depths) a steady depth progression, with a score of -9.0, because there is only one possible recapture and that move is not being searched on that node.

This is _not_ guesswork. It is a simple statement of what it was doing. And how...

As far as for "why this was done" it is because it is an easy way to get a cluster search to work. But not effective. Not even +20 Elo effective...

I think I gave this algoritm already about a year ago. It seems only 1 person understood

Without optimizations:

1. Generate all moves at the root
2. Send every move to a client for a fixed depth search (full width) Keep track which node search what move.
3. Calculate the score for this node based on the results returned.
4. Calculate the depth for this node, calculated from the reported depth of every move and a depth offset depending on the difference in score between each move and the best move.
5. Select and make the move that causes the current depth to be limited.
6 Goto 1

- Algoritm works great for "Deep analyses" and "Clustersearching"

- You can replace 2) with a monte Carlo search, only 3) needs to be able to convert the score . Therefor a engine that is able to do deep analyses, and cluster searching is most likely also able to do monte carlo searching

- One still splits at the root, it's just a variable root. So one can claim splitting at the root without lying.

- What can happen is that the best move scores 1 pawn above move 2, thererfore searched to fe 21 ply, then fails low and move 2 becomes best but was only seached (yet) to 19 ply.

- If needed, one can raise the searchdepth ( rather than growing the in memory tree), just search the same positions on the same client

- One could even use a 2nd instance of the engine to walk through the tree, and have it work at a selected position.

Tony

M ANSARI · Post by **M ANSARI** » Wed Mar 11, 2009 8:39 am

Clustering with shared memory will always be the holy grail, but I think with current networking technology the latencies are such that it is practically impossible, or at least somebody has to come up with a revolutionary way to overcome these latencies to make massively parralel clusters on differently endowed hardware work. It is probably possible to make some sort of a hardware equivalent constant (such as CCRL do with equating performance of their 20 40 matches with Athlon 4800 processors) and that might overcome the disparity of hardware ... but man that seems like a very tough nut to crack and to make work. Obviously with longer time controls things would be more robust, but I think the way Vas is doing it seems much simpler and by far the safer option, especially if it is to be used in official matches.

I went though Rybka cluster games with a very highly overclocked Octa (8 cores at 5 Ghz) and the cluster definetely finds the stronger moves faster than my machine. I am not sure how much ELO a 5 Ghz Octa should be more than a 4 Ghz Octa (which is the cluster's lowest common denominator in the cluster), but the cluster seems to easily offset the 20% less speed.

I think what has to be recognized is that accuracy of move selection for the cluster nodes to ponder is very critical. I don't think that what works with Rybka will necessarily get equivalent results with another engine. The set of moves chosen for pondering need to be quite accurate, and Rybka seems much more adept at chosing such moves at lower depth than other engines. More likely it is due to a strong initial evaluation. I would compare it to trying to make a basketball from a distance that is moving further and further. The chances of making a basket as you go further and further back are much better if your initial grouping of shots are quite accurate. Rybka seems to discard bad moves at very low ply and thus chances are it would gain more than other engines in such an implementation of a cluster. For sure there will be times when the cluster will miss a move that a shared memory equivalent would not, but I think those will be very rare instances that would have limited effect on performance ... especially at longer time controls.

Gian-Carlo Pascutto · Wed Mar 11, 2009 12:14 pm

Hi guys,

as you know I spent a lot of time on Go programs and part of this involved working with UCT and parallelizing it.

I think few people realize it because there's not many people working on both chess and go, but a strong implementation of UCT and alpha-beta with heavy pruning and LMR are much closer than one would expect, or at least, they are converging.

Very simply explained, UCT switches between exploiting the best move (extending the mainline) and exploring alternates. I've found (and so did some of the other top teams) that UCT is the strongest when you do not explore at all, but just keep hammering on the mainline until the score drops below an alternative. This is a somewhat counterintuivite result. As a result of this, despite the branching factor, top UCT Go programs search as deep or deeper as top chess programs.

In my opinion, this is almost the same as we ended up with alpha-beta, i.e. when using LMR heavily (e.g. with N=0), you just hammer the mainline until it fails low compared to another move (which might have a lot smaller depth), then you start hammering on that one, etc.

Now, in Beijing one of the top Go teams announced a result which blew my socks off. They were parallelizing UCT in several ways, some of which more fit for clustering than others. My amazement was about the result of 2 approaches:

1) Tree splitting, comparable to YBW/DTS in chess. This is what I had implemented for my Go program, and I was almost certain that this was "obviously" the approach that gave the most speedup.

2) Multiple runs. This exploits that fact that UCT has some randomness. They just ran the entire algorithm several times in parallel (with nothing shared!), and "added" the scores (this is something you cannot do directly with alphabeta, but methods to achieve the effect have been described in this thread).

Now, coming from computer chess, I would never have believed this (2) could work, because searching deep(er than your opponent) is what wins games, also in go, and multiple runs obviously doesn't help you search deeper at all.

Unfortunately for me, the results were the opposite. The multiple runs system exhibits much better speedups (not just in NPS, but in strength) than classical tree parallelization.

Because of this reason, I have no problem believing that in a program that's pruning as insanely as Rybka, just searching non-PV moves with an open(er) window gives a strength increase much more than would be normally expected. (cue Bob here with this "we did that 20 years ago and it doesn't work rant")

The thing is that I think this means our SERIAL systems (LMR and UCT minus exploration) are suboptimal. But how to fix them? Or can we get closer to optimal speedups by making our parallel implementation fundamentally different from our serial ones?

Food for thought!

Gian-Carlo Pascutto · Wed Mar 11, 2009 12:25 pm

Tony wrote: - What can happen is that the best move scores 1 pawn above move 2, thererfore searched to fe 21 ply, then fails low and move 2 becomes best but was only seached (yet) to 19 ply.

This sounds so familiar

Example with UCT (Monte Carlo Tree search):

move a: 10 000 simulations 55% winning rate
move b: 1 000 simulations 50% winning rate

after a while of searching:

move a: 20 000 simulations 49% winning rate
move b: 3 000 simulations 51% winning rate

Which one do you play? Classical UCT says you should play (a), or even better, get a time extension. I try to predict whether (b) will end up above (a) and play the prediction.

Gian-Carlo Pascutto · Wed Mar 11, 2009 12:27 pm

Tony wrote: 4. Calculate the depth for this node, calculated from the reported depth of every move and a depth offset depending on the difference in score between each move and the best move.
5. Select and make the move that causes the current depth to be limited.

If you can convert depth to nodes (you can, approximately), and scores to winning odds (you also can) then you can UCT to select which move to expand.

(5) is then exactly the step of selecting the move with the upper confidence bound

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Mar 11, 2009 3:45 pm

Vasik Rajlich wrote: The lack of shared memory imposes new constraints on your search and changes the things you need to do. In particular, without shared memory, you need to split the search tree into chunks which will remain relatively stable from iteration to iteration.

Vas

I don't get why you guys are not doing shared memory on a cluster. You can't distribute all hash entries, but you can share the deep ones, which should be the most important. I'm broadcasting them over UDP... (which means there is no actual remote-lookup-latency on hash probes! of course you want a high-bandwidth network for this)

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?