Clustering etc. thread

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:19 am

bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.

Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...

That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.

My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.

Vas
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...

As long as each node works on the same thing from iteration to iteration, you'll be fine. The lack of shared memory will hardly matter - you'll miss a handful of 'remote transpositions', which is no big deal.

As soon as a node takes on new work, you'll get massive inefficiency.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:20 am

Zach Wegner wrote:
bob wrote:Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...
I can actually see this being a big problem if you don't significantly share the hash. Say one node, searching deep in the tree, splitting about as far out as you will allow. Then, due to a root fail low, the tree changes, and this node ends up searching some completely different tree that has nothing in common with the first one, so it has almost zero hash information to go on. Of course, this also happens on a single processor, just to a lesser extent. I guess the trick is just to share hash as much as possible.

The problem is that there is no good way to share the hash. Broadcasting a handful of high-depth entries is just a drop in the bucket.

The algorithm needs to be designed around this issue.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:26 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.
Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10
If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...
The good moves most of the time take longer to search but not always.
It is possible to have a move that seems bad at small depth but the evaluation is changed at high depth.

With normal search you will not get the high depth so you are not going to see it.
With a cluster that search root moves to different depths you can see it.

Uri
And that does exactly what? we are right back to the case of different depths for different scores. You also have the reverse. The high score at depth N drops at depth N+3 but you don't get to search it that deeply. So you play this losing move, when you had a better move at depth N+3 but didn't know it because its actual score was worse than the score we are using to choose the bad move.

I consider this entire discussion hopeless, as we are flipping a coin or rolling the dice when we mix and match depths vs scores... We are also discussing typical speedup, not speedup in a bizarre case. The case you mentioned is an atypical case and I don't really care about the speedup for the oddball examples, just what is going to happen over the course of the game with respect to speedup. And it ain't going to be 2.5x.
The only way to know what happens is by testing.
The case of high score at depth N that drops at depth N+3 can happen but I do not think that it happens often and even if it happens it is no worse than normal search when you probably do not get depth N+3 but only depth N.

Uri

This just needs a test. There is no way to deduce what the speedup is going to be. My intuition tells me that it will be well over 2.

Vas

Tony · Post by **Tony** » Fri Mar 13, 2009 10:09 am

bob wrote:
Tony wrote:One of us is not getting it.

If at the root there are 30 moves, and after move 1 en 2 each has again 30 moves and we send those 88 positions to cluster clients, the total search will not speed up by 2.5 compared to a single client ?

Tony
Look at the data I posted. The total parallel search time is simply the longest time required to search any single root move. Which in the examples I gave produce a speedup of 1.1x or so for the first two, and between 1.75 and 2.33x for the third, which only happens when you change your mind on the best move in mid-iteration. That is a roughly 15% probability event based on prior analysis.

I don't understand the "88 positions". Spitting at the root means each possible root move is made and that position is sent to a cluster node to search. If you have 30 moves at the root, you can get, at most, 30 cluster nodes busy. And the real issue is the amount of time required to search the _first_ move which is the majority of the time used. And that limits the speedup drastically...

That explains.

First search the rootmoves:

move 1: sc=0.20 depth 12
move 2: sc=0.17 depth 12
move 3: sc=-0.10 depth 12
move 4: sc=-0.30 depth =12
...

If root move 1 en 2 are better than the rest, I can decide to add the children of these moves as well ( efeectively searching moves 1 and 2 to depth=13) and search these before I decide to deepen root move no 3, and then returning to move 1 and 2 again before searching move 4 to give me something like

move 1: sc=0.20 depth 14
move 2: sc=0.17 depth 14
move 3: sc=-0.10 depth 13
move 4: sc=-0.30 depth =12

Now if move 4 sudenly becomes +0.18, I would want to search it to about 14 ply as well.
If however the score would have risen to 0.22, you get what you observed: A new best move with a lower depth.

All of this can be done recursively off coarse.

What Gian-Carlo and I observed is that the most interesting/strongest seems to be:

move 1: sc=0.20 depth 18
move 2: sc=0.17 depth 14
move 3: sc=-0.10 depth 13
move 4: sc=-0.30 depth =12

where before I always tried to limit the depth difference between move 2 and move 1 ( since the score is abut the same)
The above however is very close to the beheaviour of LMR: It will not play crappy moves but it might miss better ones.

Tony

Gian-Carlo Pascutto · Fri Mar 13, 2009 10:31 am

Vasik Rajlich wrote: Ok, I haven't worked on this, but intuitively I would think that the inherent randomness would really help with the scaling. Can't you just do X completely independent MC runs and sum up the data (somehow) at the end?

Vas

Look at the leaf parallelization results in the paper.

It's a bit like trying to optimize your chessprogram by disabling lazy eval and parallelizing the eval components

Gian-Carlo Pascutto · Fri Mar 13, 2009 10:34 am

Vasik Rajlich wrote: Are these figures in your view legit?

I've not verified them, and they're in an academic paper, so they're almost certainly lies

Seriously, I've not tried root parallelization yet because I had a very highly optimized version with local mutexes. I will try it though, if only because I have a cluster now and it's obviously much easier to get running.

But I'd be surprised if they dared publish it if NOTHING was true.

Is the 3.0 speedup a legit indicator that the serial algorithm is suboptimal?

It should be.

I've not seen you really refute Bob's point about the limits of speedups with your algorithm, yet your claims suggest a much higher strength gain than would be expected. I'm inclined to think the above result is an explanation for that, but you're the only one who knows for sure. It's also possible what Bob thinks you are doing is wrong, but your arguing doesn't really suggest so.

Gian-Carlo Pascutto · Fri Mar 13, 2009 11:17 am

Gian-Carlo Pascutto wrote: I've not seen you really refute Bob's point about the limits of speedups with your algorithm, yet your claims suggest a much higher strength gain than would be expected. I'm inclined to think the above result is an explanation for that, but you're the only one who knows for sure. It's also possible what Bob thinks you are doing is wrong, but your arguing doesn't really suggest so.

I think I should clarify this. I do believe it's useful to search alternates more deeply, and I do not have any problems comparing moves from different depths. But the question is what to do with the bounds.

I mean, if you don't have an alpha value yet from your first move, what are you going to do with the others? If you don't kick off searches for alternates until you have an alpha, your speedup will obviously suck donkey balls.

If you don't have an alpha score, I can imagine 2 things:

a) You use the previous alpha. This sucks, because the score does usually change from iteration to iteration, and the score you get for an alternate might be useless.
b) You use a small window around the previous alpha. This sucks, because non-zero window searches suck.

Gian-Carlo Pascutto · Fri Mar 13, 2009 11:30 am

bob wrote: I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...

In normal YBW, you check if there are any CPUs idle before splitting. Do you also do this on a cluster? I always thought this was something that would give scaling problems.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Fri Mar 13, 2009 3:45 pm

bob wrote: When all else is equal, I certainly can choose between the two, because I have nothing else to go on. But two different moves, searched to two different depths, is a completely different animal.

I wonder if this can convince you:

If you are hash probing a position, and all successor positions are in the hashtable with sufficient depth, but one with a few extra plies (for example due to a transposition), you will take the hash scores. You'll compare the scores and pick the best scoring one.

I am SURE that when you are hash probing you are not discarding moves with "too much" depth.

So, Crafty is already doing this. And it seems to work, doesn't it?

Zach Wegner · Post by **Zach Wegner** » Fri Mar 13, 2009 3:53 pm

Vasik Rajlich wrote:The problem is that there is no good way to share the hash. Broadcasting a handful of high-depth entries is just a drop in the bucket.

The algorithm needs to be designed around this issue.

Pretty much, but I'd generalize it to latency in general. I have quite a few ideas on how to handle this, but I'm just in the design phase now. Generally I believe that the nodes should be heterogeneous in how they communicate with each other, not just based on split point locality, but also on network latency.

For better or for worse, this thread is making me want to start building the cluster already...

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: Results from UCT parallelization

Re: Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?