Clustering etc. thread

bob · Post by **bob** » Wed Mar 11, 2009 5:24 pm

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.

Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...

That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...

I have tried to explain that to others for years (splitting at the root). I split everywhere, and particularly like splitting at the root (once I work my way back to the root) since there is almost zero overhead in such searches (there is always critical hash information that might or might not be shared from one move to the next depending on how things get done in time.

But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.

I have seen reluctance to split at the root for many, because it adds another level of race conditions since the root move has to be updated...

bob · Post by **bob** » Wed Mar 11, 2009 5:32 pm

Tony wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The problem is, I am not basing my opinion on a single position. I am basing it on a whole game I watched in the last ACCA online event. Where Rybka was kibitzing PVs that showed _exactly_ how the search was being split and where...
I think that you cannot get conclusions based on pvs that rybka showed.

It is possible that rybka gave only part of the information in the pvs that you saw and the algorithm is based on combination of splitting in the root and something else.

Uri
It is possible Rybka had a GM playing the moves and supplying nonsensical kibitzes as well. But it is not very likely. This is going nowhere. I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look. There's no doubt in this case as to what is going on. There is no other possible explanation. Feel free to offer _any_ reason how one could get a PV for depth-23, then a PV for depth=19, then a PV for depth=21, then a PV for depth=18, were if you take the best moves, and collect them individually, you can see one cluster node searching a group of moves in the normal way, kibitzing a PV after each iteration, another node doing the same for a different group of nodes. Etc. Or offer any explanation of why a program would kibitz (intermingled with the kibitzes of other nodes at different depths) a steady depth progression, with a score of -9.0, because there is only one possible recapture and that move is not being searched on that node.

This is _not_ guesswork. It is a simple statement of what it was doing. And how...

As far as for "why this was done" it is because it is an easy way to get a cluster search to work. But not effective. Not even +20 Elo effective...
I think I gave this algoritm already about a year ago. It seems only 1 person understood

Without optimizations:

1. Generate all moves at the root
2. Send every move to a client for a fixed depth search (full width) Keep track which node search what move.
3. Calculate the score for this node based on the results returned.
4. Calculate the depth for this node, calculated from the reported depth of every move and a depth offset depending on the difference in score between each move and the best move.
5. Select and make the move that causes the current depth to be limited.
6 Goto 1

- Algoritm works great for "Deep analyses" and "Clustersearching"

- You can replace 2) with a monte Carlo search, only 3) needs to be able to convert the score . Therefor a engine that is able to do deep analyses, and cluster searching is most likely also able to do monte carlo searching

- One still splits at the root, it's just a variable root. So one can claim splitting at the root without lying.

- What can happen is that the best move scores 1 pawn above move 2, thererfore searched to fe 21 ply, then fails low and move 2 becomes best but was only seached (yet) to 19 ply.

- If needed, one can raise the searchdepth ( rather than growing the in memory tree), just search the same positions on the same client

- One could even use a 2nd instance of the engine to walk through the tree, and have it work at a selected position.

Tony

I don't see how that addresses all of the following:

move A is searched to depth N. Move B is searched to depth N+2.

(1) A is much better than B (score). Which to play?

(2) B is much better than A (score). Which to play?

(3) A and B are very close. Which to play?

Some seem to think it is easy to handle case (2). I would only offer the case of how many times do you search the move at depth N+2 and get a good score, and then fail high on _another_ move, sometimes a devastating fail-high? (answer is some fraction of 15% since we change our mind on the last iteraton about 15% of the time. The "easy" answer for (2) will miss this condition. So let's change (2) a bit.

(2a) B is much better than A, but B is still a relatively bad score. Which to play?
It is too easy to say "X" is greater than "Y" so just play X. But what if X is not a good score and Y (searched 2 plies deeper) is a winner?

That's why unsynchronized search at the root has never caught on. Monty Newborn tried it with his parallel Ostrich program. Jonathan Schaeffer tried it in Sun Phoenix (cluster based) with a fast tactical searcher always going deeper than the slower positional searcher. And trying to somehow use the tactical searcher to help the positional searcher avoid traps did not work well. If both agree on the best move, you are home-free. But if not...

bob · Post by **bob** » Wed Mar 11, 2009 5:40 pm

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: The lack of shared memory imposes new constraints on your search and changes the things you need to do. In particular, without shared memory, you need to split the search tree into chunks which will remain relatively stable from iteration to iteration.

Vas
I don't get why you guys are not doing shared memory on a cluster. You can't distribute all hash entries, but you can share the deep ones, which should be the most important. I'm broadcasting them over UDP... (which means there is no actual remote-lookup-latency on hash probes! of course you want a high-bandwidth network for this)

I had thought of the UDP solution. But there is a problem. On a cluster with very fast nodes (I have a 70 node cluster with dual quad-core 2.33ghz xeons) one can produce a huge volume of traffic. Even with infiniband which just whips gigabit ethernet to pieces, you run into a different issue. You can generate so many broadcasts that (a) you can see system overhead go up. In a worst-case on my cluster here, you can have 560 broadcasts per hash store since every processor will need to broadcast when it updates the hash. And that happens quite often near the tips. That means lots of lost packets also, in addition to the overhead of handling the packets.

There have been lots of attempts to do shared hash in cluster programs. All going to a faster connection fabric does is just push the problem off one ply deeper or so.

At some point I am going to run some tests to see how deeply in the tree the boradcast can be done without impacting everything else. I suspect this will be one of those "tunable" parameters since there are so many ways to connect a cluster with so many different latency values.

bob · Post by **bob** » Wed Mar 11, 2009 5:45 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
No program today have search that is good enough but in theory
if you have a good search the speed up may be smaller from splitting at the root for the simple reason that the value of 1.5x speed improvement is bigger than the value of one ply with no pruning.

I think that
some type of bad evaluation also can cause the speed up to be smaller
or even negative.

Imagine that you search with no pruning and extensions and no qsearch and imagine that you have an evaluation that gives no bonus for the side to move so practically you often evaluate even depth as better than odd depths(maybe it should be the opposite and I did not think about it but the idea is the same)

If you split at the root you may get depth 7 for move A and depth 8 for move B and prefer move B not because it is better but only because you searched to even depth.

This problem does not happen without splitting at the root because without splitting at the root you always get the same depth for all moves.

Uri
You are badly misusing terminology.

1. "splitting at the root" does not mean each root move gets searched to different depth. I split at the root in current crafty. that only means that I do a parallel search on the root moves as well as deeper moves, because this is a far more efficient way to search.

2. You are using what is called "unsynchronized search" where each node searched a different move (or set of moves) at the root, and when iteration N is done, N+1 is started without regard for how the other moves are progressing on other nodes.

this is useless.

How can you choose between depth 21, eval=+1.3, and depth 19, eval +2.5?? You can't. This has been tried in the past, by Newborn, by Schaeffer, and by others. There is no way to compute any sort of equivalence function so that you can decide which of the above is better. The depth 19 move might be even higher at depth 21. Or it might be way lower. The only way to discover this is to search both moves to the same depth. Anything else is beyond hopeless and is a coin flip. You can't even choose between depth 21, +1.5, and depth 19, -2.0, because by depth 21 the -2.0 score might be +5.0...
I can certainly choose and the simplest choice is to choose higher score is better.
depth 19 eval=+2.5 is better than depth 21 eval=+1.3 for the simple reason that 2.5>1.3

I believe that with relatively good evaluation it is not so bad choice(not that splitting at the root is good but it may give effective speed improvement of 2 or more than 2 assuming that all root moves are searched).

I may be wrong and the only way to know is by testing.

Uri
There is absolutely no doubt you are wrong. How many times have you seen a program search a move to depth N with a + score, then at depth n+1 the score drops off the map and it switches to a new move? If you believe what you wrote, why not stop your search when the eval gets to a good point? Why continue to search deeper?

As far as a speedup of > 2, it simply will not/can not happen. Amdahl's law says that the overall speedup for an algorithm is:

speedup = time(N) / T(P)

T(P) = sequential_processing_time + parallel_processing_time / N

where N = number of processors.

If the first move takes 50% of the time, and that is the usual case, then you absolutely can not get a speedup > 2.0. It is impossible. Except for those cases where the first move is not best. Now you search the second move. And it takes the same amount of time. So you do get a bigger boost there. But that only happens about 15% of the time. 85% of the time the speedup is going to be way below 2.0 no matter how many CPUs you use. And there is simply no way to escape the basic sequential property of the alpha/beta algorithm.
Of course the score may change but by the same logic you also cannot compare between scores at the same depths because the program may change its mind in later depth.
I don't see where this nonsense comes from. Two moves, different scores, equal depths. I most certainly can compare them. And choose the largest. But if the depths are different, and the scores are different, I can only flip a coin to choose one. Choosing the best is wrong. Choosing the deepest depth is wrong. Choosing one of the two moves with different depths is just a purely random choice.

You do not know the best move and you always guess based on information and hope to be right.
I trust the search to at least search the two moves in an equal way, so that the resulting backed-up scores can be compared. That's what minimax and alpha/beta is all about.

If it is not the case then you cannot also compare scores at the same depth because practically one move is searched to normal depth and another move is searched to smaller depth because of late move reductions.

It may be bad to compare score in different depth with
bad evaluation and no reductions but it is not the practical case(maybe it was the case 20 or 30 years ago).

Uri
It is still the case today. And I am amazed you don't understand why...
1)With late move reductions even searching to the same depth can be practically different depth so if you cannot compare different depth then you also cannot compare the same depth.

All I can say to that is <sigh>

If you don't think we can compare scores from the same depth, then that would mean you do not believe alpha/beta or minimax work, because that is _exactly_ what they are based on. I'm not going to continue with the apples-to-oranges nonsense. If that is your belief, then that is your belief. It is wrong, however.

2)I think that an interesting experiment that everybody can do may be to test A against B when A search at fixed number of nodes without splitting the root(let say 100,000 nodes per move) and B search after every root move at 50,000 nodes per move.

B choose a move simply by comparing the scores without caring about the fact that depths are different.

I do not know if A wins or B wins and it may be dependent on the program but I see no reason to assume that it is obvious that B cannot win.

The reason B loses is intuitively obvious to the casual observer (who understands alpha/beta minimax). But apparently the point is lost here so just carry that belief with you...

Assuming nodes are nearly proportional to time if
B wins then it suggests that splitting at the root can give speed up of more than 2:1.

Uri

And again, I gave you the math. The math is absolutely infallible. And yet you keep saying 2+2 = 5, in spite of any mathematical analysis you are given...

If we can't communicate via accurate math, I can't communicate with "assuming..." and "maybe..." and "it might..." This is an exact science.

bob · Post by **bob** » Wed Mar 11, 2009 5:50 pm

Vasik Rajlich wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
No program today have search that is good enough but in theory
if you have a good search the speed up may be smaller from splitting at the root for the simple reason that the value of 1.5x speed improvement is bigger than the value of one ply with no pruning.

I think that
some type of bad evaluation also can cause the speed up to be smaller
or even negative.

Imagine that you search with no pruning and extensions and no qsearch and imagine that you have an evaluation that gives no bonus for the side to move so practically you often evaluate even depth as better than odd depths(maybe it should be the opposite and I did not think about it but the idea is the same)

If you split at the root you may get depth 7 for move A and depth 8 for move B and prefer move B not because it is better but only because you searched to even depth.

This problem does not happen without splitting at the root because without splitting at the root you always get the same depth for all moves.

Uri
You are badly misusing terminology.

1. "splitting at the root" does not mean each root move gets searched to different depth. I split at the root in current crafty. that only means that I do a parallel search on the root moves as well as deeper moves, because this is a far more efficient way to search.

2. You are using what is called "unsynchronized search" where each node searched a different move (or set of moves) at the root, and when iteration N is done, N+1 is started without regard for how the other moves are progressing on other nodes.

this is useless.

How can you choose between depth 21, eval=+1.3, and depth 19, eval +2.5?? You can't. This has been tried in the past, by Newborn, by Schaeffer, and by others. There is no way to compute any sort of equivalence function so that you can decide which of the above is better. The depth 19 move might be even higher at depth 21. Or it might be way lower. The only way to discover this is to search both moves to the same depth. Anything else is beyond hopeless and is a coin flip. You can't even choose between depth 21, +1.5, and depth 19, -2.0, because by depth 21 the -2.0 score might be +5.0...
I can certainly choose and the simplest choice is to choose higher score is better.
depth 19 eval=+2.5 is better than depth 21 eval=+1.3 for the simple reason that 2.5>1.3

I believe that with relatively good evaluation it is not so bad choice(not that splitting at the root is good but it may give effective speed improvement of 2 or more than 2 assuming that all root moves are searched).

I may be wrong and the only way to know is by testing.

Uri
There is absolutely no doubt you are wrong. How many times have you seen a program search a move to depth N with a + score, then at depth n+1 the score drops off the map and it switches to a new move? If you believe what you wrote, why not stop your search when the eval gets to a good point? Why continue to search deeper?

As far as a speedup of > 2, it simply will not/can not happen. Amdahl's law says that the overall speedup for an algorithm is:

speedup = time(N) / T(P)

T(P) = sequential_processing_time + parallel_processing_time / N

where N = number of processors.

If the first move takes 50% of the time, and that is the usual case, then you absolutely can not get a speedup > 2.0. It is impossible. Except for those cases where the first move is not best. Now you search the second move. And it takes the same amount of time. So you do get a bigger boost there. But that only happens about 15% of the time. 85% of the time the speedup is going to be way below 2.0 no matter how many CPUs you use. And there is simply no way to escape the basic sequential property of the alpha/beta algorithm.
Of course the score may change but by the same logic you also cannot compare between scores at the same depths because the program may change its mind in later depth.
I don't see where this nonsense comes from. Two moves, different scores, equal depths. I most certainly can compare them. And choose the largest. But if the depths are different, and the scores are different, I can only flip a coin to choose one. Choosing the best is wrong. Choosing the deepest depth is wrong. Choosing one of the two moves with different depths is just a purely random choice.

You do not know the best move and you always guess based on information and hope to be right.
I trust the search to at least search the two moves in an equal way, so that the resulting backed-up scores can be compared. That's what minimax and alpha/beta is all about.

If it is not the case then you cannot also compare scores at the same depth because practically one move is searched to normal depth and another move is searched to smaller depth because of late move reductions.

It may be bad to compare score in different depth with
bad evaluation and no reductions but it is not the practical case(maybe it was the case 20 or 30 years ago).

Uri
It is still the case today. And I am amazed you don't understand why...
1)With late move reductions even searching to the same depth can be practically different depth so if you cannot compare different depth then you also cannot compare the same depth.

2)I think that an interesting experiment that everybody can do may be to test A against B when A search at fixed number of nodes without splitting the root(let say 100,000 nodes per move) and B search after every root move at 50,000 nodes per move.

B choose a move simply by comparing the scores without caring about the fact that depths are different.

I do not know if A wins or B wins and it may be dependent on the program but I see no reason to assume that it is obvious that B cannot win.

Assuming nodes are nearly proportional to time if
B wins then it suggests that splitting at the root can give speed up of more than 2:1.

Uri
Yes, exactly. I could not have said it better. This is even (slightly) better than the experiment I suggested above. Uri, you should build a cluster.

Of course, nobody is arguing that the best algorithm is pure root splitting. It's a very important building block, though. And data on this phenomenon (which I actually don't have) would be very interesting.

Vas

If you really believe that, then I am convinced that both of you are hopeless with regard to this issue.

The math _is_ simple. All these tests have been done in the past. I even did it on far better hardware than you have (shared memory, not a cluster). You are not going to produce a speedup of > 2.0 except for that rare "super-linear type position" that happens in a normal parallel search. Over the long-haul, a 4 node cluster is going to average a 1.5x speedup. If you go to a relaxed (unsynchronized) model, you can keep all processors busy all the time, which helps a small amount. But then that introduces another random element of comparing scores from different depths, which alpha/beta simply does not do.

I suppose I could run this test easily enough. I could use 8 nodes and only split at the root, and modify how the search works so that each node just keeps searching (unsynchronized) until time expires. But I already know the performance results having done this in the past. And what I get would not be doable on a cluster because I would still have a fully shared hash. But this is a test that really is pointless to run because the results are already known...

bob · Post by **bob** » Wed Mar 11, 2009 5:54 pm

Vasik Rajlich wrote:
bob wrote:
Uri Blass wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
No program today have search that is good enough but in theory
if you have a good search the speed up may be smaller from splitting at the root for the simple reason that the value of 1.5x speed improvement is bigger than the value of one ply with no pruning.

I think that
some type of bad evaluation also can cause the speed up to be smaller
or even negative.

Imagine that you search with no pruning and extensions and no qsearch and imagine that you have an evaluation that gives no bonus for the side to move so practically you often evaluate even depth as better than odd depths(maybe it should be the opposite and I did not think about it but the idea is the same)

If you split at the root you may get depth 7 for move A and depth 8 for move B and prefer move B not because it is better but only because you searched to even depth.

This problem does not happen without splitting at the root because without splitting at the root you always get the same depth for all moves.

Uri
You are badly misusing terminology.

1. "splitting at the root" does not mean each root move gets searched to different depth. I split at the root in current crafty. that only means that I do a parallel search on the root moves as well as deeper moves, because this is a far more efficient way to search.

2. You are using what is called "unsynchronized search" where each node searched a different move (or set of moves) at the root, and when iteration N is done, N+1 is started without regard for how the other moves are progressing on other nodes.

this is useless.

How can you choose between depth 21, eval=+1.3, and depth 19, eval +2.5?? You can't. This has been tried in the past, by Newborn, by Schaeffer, and by others. There is no way to compute any sort of equivalence function so that you can decide which of the above is better. The depth 19 move might be even higher at depth 21. Or it might be way lower. The only way to discover this is to search both moves to the same depth. Anything else is beyond hopeless and is a coin flip. You can't even choose between depth 21, +1.5, and depth 19, -2.0, because by depth 21 the -2.0 score might be +5.0...
I use the same terminology that Uri does. Splitting at the root means splitting at the root. One unit handles moves a, b, c, another handles moves d, e, f. What the units do with those moves is an implementation detail.

Vas

The problem is that "splitting at the root" does _not_ imply "unsynchronized splitting at the root".

The word "unsynchronized" has to be included, because I split at the root, and always have, and I do not do any sort of unsynchronized iterative deepening search at all. My search proceeds ply by ply for _all_ root moves.

bob · Post by **bob** » Wed Mar 11, 2009 6:10 pm

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
Sorry, but no you won't. The math is simple and comes directly from Amdahl's law and the original alpha/beta analysis by knuth and moore, followed by the parallel analysis I did in the Journal of parallel computing. The first move takes 50% of the time. And that is going to be done on one processor. if you can somehow solve the "window" problem, so that you can search the remainder of the ply-1 moves in parallel with a correct alpha/beta window, then you can shrink those to the time required to search one of them. But you have no alpha/beta window, so all moves searched in parallel (ignoring the first move) are going to take about as long as the first move (each one will take that long) because you do _not_ yet have the alpha bound from the first move to let you get the quick cutoffs on the rest of the move.

Best case is 2x faster, since you could assume that any root move takes as long as any other when you do not have a good alpha bound. And what you can get "peak" is not going to be what you get "on average". >2x is just not going to happen except for rare cases. I have a position somewhere where you can get a 20x speedup like that. First move takes forever to find a very deep mate. Second move is a shorter mate in 1/20th the total nodes searched. Searching both at the same time finds the second mate before the first has even been "sniffed". But that is simply an exception. For the general case, 2x is the very best you can hope for, and it is not going to happen often...

I can give you a citation for the paper I wrote that uses the math from Knuth / Moore and extends it to cover alpha/beta in parallel. It is easy enough to understand and explains exactly why >2x is a rare case, not the norm...
The point is that that 2x is not the upper bound. You'll have 2x more time to dedicate to searching the best move and maybe 20x more time to dedicate to searching alternative moves.

Let's keep this mathematically valid. You already stated, and I agree, that the first move at the root for iteration N takes about 50% of the total search time. And it is quite often much more than this:

Here are a couple of examplles:

Code: Select all

               20->   3&#58;22  -0.01   14. ... O-O 15. Bb2 f5 16. Bd3 c5 17.
                                    Qe2 cxd4 18. Bxd4 Qe7 19. Bxa7 Bxf3
                                    20. gxf3 Bxh2+ 21. Kxh2 Qh4+ 22. Kg2
                                    Qg5+ 23. Kh1 Qh4+ 24. Kg1 Qg5+ 25.
                                    Kh1 &#40;s=2&#41;
               21     5&#58;01  -0.05   14. ... O-O 15. Bb2 f5 16. Bd3 c5 17.
                                    Qe2 Qe7 18. Rfd1 Rfd8 19. Nd2 cxd4
                                    20. Bxd4 Ne5 21. Nc4 Nxd3 22. Nxd6
                                    Qxd6 <HT>
               21->   5&#58;11  -0.05   14. ... O-O 15. Bb2 f5 16. Bd3 c5 17.
                                    Qe2 Qe7 18. Rfd1 Rfd8 19. Nd2 cxd4
                                    20. Bxd4 Ne5 21. Nc4 Nxd3 22. Nxd6
                                    Qxd6 <HT>

First move at depth 21 took 1:39, rest of the moves took 10 seconds total.

Code: Select all


               20->   1&#58;24  -0.03   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
                                    17. Rfb1 Qc7 18. Rxb7 Qxb7 19. Rc1
                                    Rc8 20. dxc5 Rxc5 21. Rxc5 Bxc5 22.
                                    Bb4 Nf6 23. Qa5 Bxb4 24. axb4
               21     2&#58;40   0.05   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
                                    17. Rfb1 Qc7 18. Rxb7 Qxb7 19. Rc1
                                    Rc8 20. dxc5 Rxc5 21. Rxc5 Bxc5 22.
                                    Bb4 Nf6 23. Qa5 Bxb4 24. axb4 Qe7
               21->   2&#58;55   0.05   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
                                    17. Rfb1 Qc7 18. Rxb7 Qxb7 19. Rc1
                                    Rc8 20. dxc5 Rxc5 21. Rxc5 Bxc5 22.
                                    Bb4 Nf6 23. Qa5 Bxb4 24. axb4 Qe7
               22     4&#58;08   0.02   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7

first move at depth 21 took 1:16, rest of the moves took 15 seconds. All you can get there is a speedup on that extra 15 seconds. You will barely run faster at all. Here is a third case where Crafty actually changes its mind (these are from the same game, consecutive moves to be fair. The following case preceeds the previous two.

Code: Select all

               18->   1&#58;01   0.23   13. ... O-O 14. Rb1 Rb8 15. Qa4 f5
                                    16. Bc2 c5 17. dxc5 Bxc5 18. Bb2 Qe7
                                    19. Rfd1 Rfd8 20. e4 Nf6 21. Rxd8+
                                    Rxd8 22. exf5 Bxf3 23. gxf3 &#40;s=2&#41;
               19     1&#58;34   0.31   13. ... O-O 14. Qc2 Nf6 15. Rb1 Ba6
                                    16. Bd3 Bxd3 17. Qxd3 Rb8 18. Rxb8
                                    Qxb8 19. e4 Be7 20. e5 Nd5 21. Ng5
                                    Bxg5 22. Bxg5 Qb5
               19     2&#58;18   0.16   13. ... Rb8 14. Rb1 O-O 15. Qa4 f5
                                    16. Bc2 c5 17. dxc5 Bxc5 18. Bb2 Qe7
                                    19. Rfd1 Nb6 <HT>
               19->   2&#58;18   0.16   13. ... Rb8 14. Rb1 O-O 15. Qa4 f5
                                    16. Bc2 c5 17. dxc5 Bxc5 18. Bb2 Qe7
                                    19. Rfd1 Nb6 <HT> &#40;s=2&#41;

First move at 19 took 33 seconds, next move took 44 seconds, rest of the moves are buried in the time for the second move. Here you might actually get a speedup of almost 2, assuming your split at the root algorithm is good enough to make sure those two moves are given to different clusters (there is no indication that they are the two best moves, the second never appeared in previous searches.)

So > 2x is simply not going to happen except in very rare cases. Most of the time, if you don't change your mind, you are looking at 1.1x or so as the above cases show. I have several long games I can post if someone wants to go thru them as I did above and compute the average speedup for the entire game...

It's not hard to do a simulation. Crafty A always plays its move after 10 seconds of search using his normal algorithm, while Crafty B spends 10 seconds on every single root move and then plays the one with the best score.

I would be curious to know the result here.

Vas

I assume you mean "pure best score" without regard to depth? I can probably pull this off easily enough.

I might even be able to find monty's old test results as he did exactly this on his data general "better-than-a-cluster cluster" years ago (he had a shared I/O bus that was way faster than net-based communication.

I would expect the split at the root version to win, because there is some performance to be gained, although there is some lost as well due to no shared hash. I'd expect the difference to be in the 20-30 Elo range based on the expected speedup. Remember, in many searches the best move is the best move for all searches, and the extra searching on other moves produces zero.

bob · Post by **bob** » Wed Mar 11, 2009 6:19 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:I can add that not the same algorithm is used so we do not have a theorethic value for the possible effective speed up and the value may be different for different programs.

Uri
Eh? All that matters is how long does the first move take to search compared to the rest of the moves. I've not seen a program where this is not approximately 1/2 of the total time for the first move. That means max speedup is 2 and actual speedup is going to be more like 1.5x as an optimal number.

Math is simple.

First move takes 50% of time. In 85% of cases the first move is best. So the parallel search time is T/2 for the first move, and T/2N for the rest. For large N, T/2N is near zero, but you still have that T/2 for the first move

speedup = serial_time / parallel_time

optimal speedup = T / (T/2) which looks very much like 2.0 to me. And that's all you can get. Best case.

With 2 processors, you get T / (T/2 + T/4) which is T / (T/.75)) or 1 / .75 = 1.3x

It really is that simple.

And anyone working on a parallel search understands that.

BTW, 5 nodes. optimal speedup = T / (T/2 + T/10) = T / .6 = 1.66x for the Rybka cluster box used for the ACCA event. Close to my 1.5x estimate. Been there. Done that. Got the T-shirt.
I do not know much about parallel search but I see no reason to assume that the same algorithm is used.

cluster is different conditions(worse relative to having the same nodes in normal parallel search) so it is logical to hope that a different algorithm
may be better.

"You can wish in one hand, and crap in the other, and see which one fills up first.." - Burgess Meridith, Grumpy Old Men.

We are using alpha/beta minimax at the present. And that is what I based my analysis on, and I clearly stated that at the beginning of this discussion. I am not "assuming" that some better search algorithm might be developed for use on a cluster. I am not assuming anything other than that we are all using minimax search with alpha/beta pruning. And that fixes the math quite nicely.

If something better than alpha/beta can be done, that would be great. But to date, it has not happened. It has not happened in almost 60 years since Shannon wrote "How to program a computer to play chess." Followed by the Newell, Simon and Shaw paper on alpha-beta pruning. So I am not arguing that a possible future algorithm might produce >2x speedup by splitting at the root. I am discussing _today's_ algorithms as used in all chess-playing programs.

I'd prefer science fact, rather than science fiction, for the present.

Zach Wegner · Post by **Zach Wegner** » Wed Mar 11, 2009 6:24 pm

Tony wrote:I think I gave this algoritm already about a year ago. It seems only 1 person understood

Without optimizations:

1. Generate all moves at the root
2. Send every move to a client for a fixed depth search (full width) Keep track which node search what move.
3. Calculate the score for this node based on the results returned.
4. Calculate the depth for this node, calculated from the reported depth of every move and a depth offset depending on the difference in score between each move and the best move.
5. Select and make the move that causes the current depth to be limited.
6 Goto 1

- Algoritm works great for "Deep analyses" and "Clustersearching"

- You can replace 2) with a monte Carlo search, only 3) needs to be able to convert the score . Therefor a engine that is able to do deep analyses, and cluster searching is most likely also able to do monte carlo searching

- One still splits at the root, it's just a variable root. So one can claim splitting at the root without lying.

- What can happen is that the best move scores 1 pawn above move 2, thererfore searched to fe 21 ply, then fails low and move 2 becomes best but was only seached (yet) to 19 ply.

- If needed, one can raise the searchdepth ( rather than growing the in memory tree), just search the same positions on the same client

- One could even use a 2nd instance of the engine to walk through the tree, and have it work at a selected position.

Tony

Certainly an interesting algorithm, and quite similar to UCT at one node as GCP points out. Have you tested it for pure strength against a normal root search for single processors? Seems like it has the potential to be better, just as a more optimal usage of time (mate scores can have confidence of infinity so aren't searched further, etc).

Rolf · Post by **Rolf** » Wed Mar 11, 2009 6:40 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:I can add that not the same algorithm is used so we do not have a theorethic value for the possible effective speed up and the value may be different for different programs.

Uri
Eh? All that matters is how long does the first move take to search compared to the rest of the moves. I've not seen a program where this is not approximately 1/2 of the total time for the first move. That means max speedup is 2 and actual speedup is going to be more like 1.5x as an optimal number.

Math is simple.

First move takes 50% of time. In 85% of cases the first move is best. So the parallel search time is T/2 for the first move, and T/2N for the rest. For large N, T/2N is near zero, but you still have that T/2 for the first move

speedup = serial_time / parallel_time

optimal speedup = T / (T/2) which looks very much like 2.0 to me. And that's all you can get. Best case.

With 2 processors, you get T / (T/2 + T/4) which is T / (T/.75)) or 1 / .75 = 1.3x

It really is that simple.

And anyone working on a parallel search understands that.

BTW, 5 nodes. optimal speedup = T / (T/2 + T/10) = T / .6 = 1.66x for the Rybka cluster box used for the ACCA event. Close to my 1.5x estimate. Been there. Done that. Got the T-shirt.
I do not know much about parallel search but I see no reason to assume that the same algorithm is used.

cluster is different conditions(worse relative to having the same nodes in normal parallel search) so it is logical to hope that a different algorithm
may be better.
"You can wish in one hand, and crap in the other, and see which one fills up first.." - Burgess Meridith, Grumpy Old Men.

We are using alpha/beta minimax at the present. And that is what I based my analysis on, and I clearly stated that at the beginning of this discussion. I am not "assuming" that some better search algorithm might be developed for use on a cluster. I am not assuming anything other than that we are all using minimax search with alpha/beta pruning. And that fixes the math quite nicely.

If something better than alpha/beta can be done, that would be great. But to date, it has not happened. It has not happened in almost 60 years since Shannon wrote "How to program a computer to play chess." Followed by the Newell, Simon and Shaw paper on alpha-beta pruning. So I am not arguing that a possible future algorithm might produce >2x speedup by splitting at the root. I am discussing _today's_ algorithms as used in all chess-playing programs.

I'd prefer science fact, rather than science fiction, for the present.

That is in no way something against you in personal or the now modern tricks. If one would follow you progress or other paradigms were absolutely unrealistic, something you cannot say IMO. Treating people with respect, who think ahead or just ask questions, is IMO selfunderstood.

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?