Clustering etc. thread

Zach Wegner · Post by **Zach Wegner** » Fri Mar 13, 2009 4:25 pm

Gian-Carlo Pascutto wrote:I think I should clarify this. I do believe it's useful to search alternates more deeply, and I do not have any problems comparing moves from different depths. But the question is what to do with the bounds.

I mean, if you don't have an alpha value yet from your first move, what are you going to do with the others? If you don't kick off searches for alternates until you have an alpha, your speedup will obviously suck donkey balls.

If you don't have an alpha score, I can imagine 2 things:

a) You use the previous alpha. This sucks, because the score does usually change from iteration to iteration, and the score you get for an alternate might be useless.
b) You use a small window around the previous alpha. This sucks, because non-zero window searches suck.

One possibility is this (another idea from Tony): use small windows around the previous score for each move. As an advantage, you get multi PV for free.

Another possibility is a mutation of MTD(f). I'm just making this up here, so bear with me. Each move uses its own previous alpha as a starting bound, then does a binary-type search to narrow the score. This part is like Tony's idea except replacing the aspiration window with a series of scout searches. But, you also maintain a global lower bound on the real score that you can pass around to each node. If for move 7 you know that the score is less than +1.00 due to a scout search, and then move 4 returns +1.50, you get a cutoff and can abandon move 7.

Of course, you don't actually get the "previous alpha" with this. You'd have to just use whatever information you have on the move, or you can use the real alpha of the previous iteration. Also, you have to decide what to do if you are doing an initial search around 0.00, and another move comes back as >1.00. Do you abandon the search and start again around 1.01, or hope for a <=0.00 score?

bob · Post by **bob** » Fri Mar 13, 2009 5:09 pm

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.

Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...

That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.

My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.

Vas
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
As long as each node works on the same thing from iteration to iteration, you'll be fine. The lack of shared memory will hardly matter - you'll miss a handful of 'remote transpositions', which is no big deal.

As soon as a node takes on new work, you'll get massive inefficiency.

Vas

I don't agree with that at all. Lots of past experimentation by myself and others ( did a sorta-cluster Cray Blitz once but it used Cray's HIPPI channel which was not exactly high-latency) has shown just how significant the loss of transpositions across the nodes is, and it is non-trivial. And is never goine to come even remotely close to what we can do on a fully-shared memory system.

There is significant interaction between different root moves, and this continues to grow as the game moves toward the endgame. Fine 70 is probably the ultimate type position where this shows up best (although anyone can search this to 40+ plies instantly). But searching each root move on a different node greatly delays the depth at which you find the winning move Kb1, usually forcing this all the way out to ply 26 when most find it at 18-20 or so.

bob · Post by **bob** » Fri Mar 13, 2009 5:15 pm

Gian-Carlo Pascutto wrote:
bob wrote: I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
In normal YBW, you check if there are any CPUs idle before splitting. Do you also do this on a cluster? I always thought this was something that would give scaling problems.

I have two plans here. Plan one is that one node will be the "master" and split with other nodes. But they won't split with anyone. This will be the first cut as it will be the simplest to implement and test and offer the fewest potential hiding places for bugs.

Plan 2 is to become recursive as the current search already is, so that the "master" can split with others, and the others can split themselves as well. I will probably start with some sort of hierarchy where the master splits with a, e, i, m. And than a can split with b, c, d, and e splits with f, g, h, etc. This is probably going to be an experimental slugfest to find the best way to both scale reasonably (nothing like the current parallel search scales, but hopefully in a still linear manner) and actually see the search speed remain proportional to the number of nodes. We can theoretically reach 1.4 billion nodes per second on our current 70 node cluster, since I am doing 20M on a single node at present. I'd like to get some major fraction of that performance. But I'm aware this is going to be fairly quick to implement, but take a _long_ time to work out the tuning details...

I sort of have to do something similar for a couple of NUMA boxes I have tested on, since remote nodes really add up in terms of latency to reach their memory.

bob · Post by **bob** » Fri Mar 13, 2009 5:17 pm

Vasik Rajlich wrote:
Gian-Carlo Pascutto wrote:
bob wrote:
Gian-Carlo Pascutto wrote: What's the difference between UCT with a tiny exploration term (=current top go programs) and an alpha-beta search with very heavy LMR (=current top chess programs), in terms of nodes explored?
Not quite the same thing IMHO. While we are certainly reducing, we are not doing any sort of MC analysis (which is based on very fast game speeds). (...)
MC analysis is just the evaluation function, or the quiescent search, if you wish. I asked a question about nodes explored and you give a (flawed) argument that the evaluation is not the same.

MC analysis with a varying number of samples is equivalent to an evaluation function with lazy exits, from the point of view of the tree search.

And growing the tree based on a sort of probability model. In chess, at least my program, I am not doing any probabilistic analysis,
Really? What is LMR you think? You're probabilistically reducing moves besides the first (few), on the observation that in general your move ordering is likely to be correct. How is this different from UCT favoring the PV and exploring the alternates less?

I am just dynamic/static criteria to decide what looks interesting to search and what appears to be almost useless. But rather than giving up on a move as useless, it just gets reduced in depth to decrease effort required to dismiss it.
Exactly as UCT does.
It's an interesting point. You could say that once the go programmers figured out the right way to evaluate leaf nodes, and once the chess programmers figured out how important the main line really is and how well LMR-type things can work, the two approaches to tree search converged.

Vas

Except for the issues with a best-first vs depth-first search, where the later is _way_ more efficient, I would agree to an extent. But BF search does have issues with respect to searching stuff alpha/beta would toss out quickly.

bob · Post by **bob** » Fri Mar 13, 2009 5:21 pm

Tony wrote:
bob wrote:
Tony wrote:One of us is not getting it.

If at the root there are 30 moves, and after move 1 en 2 each has again 30 moves and we send those 88 positions to cluster clients, the total search will not speed up by 2.5 compared to a single client ?

Tony
Look at the data I posted. The total parallel search time is simply the longest time required to search any single root move. Which in the examples I gave produce a speedup of 1.1x or so for the first two, and between 1.75 and 2.33x for the third, which only happens when you change your mind on the best move in mid-iteration. That is a roughly 15% probability event based on prior analysis.

I don't understand the "88 positions". Spitting at the root means each possible root move is made and that position is sent to a cluster node to search. If you have 30 moves at the root, you can get, at most, 30 cluster nodes busy. And the real issue is the amount of time required to search the _first_ move which is the majority of the time used. And that limits the speedup drastically...
That explains.

First search the rootmoves:

move 1: sc=0.20 depth 12
move 2: sc=0.17 depth 12
move 3: sc=-0.10 depth 12
move 4: sc=-0.30 depth =12
...

If root move 1 en 2 are better than the rest, I can decide to add the children of these moves as well ( efeectively searching moves 1 and 2 to depth=13) and search these before I decide to deepen root move no 3, and then returning to move 1 and 2 again before searching move 4 to give me something like

move 1: sc=0.20 depth 14
move 2: sc=0.17 depth 14
move 3: sc=-0.10 depth 13
move 4: sc=-0.30 depth =12

Now if move 4 sudenly becomes +0.18, I would want to search it to about 14 ply as well.
If however the score would have risen to 0.22, you get what you observed: A new best move with a lower depth.

All of this can be done recursively off coarse.

What Gian-Carlo and I observed is that the most interesting/strongest seems to be:

move 1: sc=0.20 depth 18
move 2: sc=0.17 depth 14
move 3: sc=-0.10 depth 13
move 4: sc=-0.30 depth =12

where before I always tried to limit the depth difference between move 2 and move 1 ( since the score is abut the same)
The above however is very close to the beheaviour of LMR: It will not play crappy moves but it might miss better ones.

Tony

That's an issue. I find a partial workaround is automatic because you do spend most of the effort on the first move. And when it starts to go south, this triggers a time-overflow where you can search deeper and as the best goes south, the rest go north automatically and you may well solve this anyway. But if your program is "twittery" where it has lots of centipawn PV changes, LMR is going to shut that down pretty effectively, which could be a problem.

And again, when we start talking about best-first search, we lose orders of magnitude with respect to straight alpha/beta...

bob · Post by **bob** » Fri Mar 13, 2009 5:27 pm

Gian-Carlo Pascutto wrote:
bob wrote: When all else is equal, I certainly can choose between the two, because I have nothing else to go on. But two different moves, searched to two different depths, is a completely different animal.
I wonder if this can convince you:

If you are hash probing a position, and all successor positions are in the hashtable with sufficient depth, but one with a few extra plies (for example due to a transposition), you will take the hash scores. You'll compare the scores and pick the best scoring one.

I am SURE that when you are hash probing you are not discarding moves with "too much" depth.

So, Crafty is already doing this. And it seems to work, doesn't it?

I have never said this "won't work." I have been asking the question "what is the gain"? With a normal parallel search, we can measure the gain quite easily as time to depth, time to solution, etc. But with this approach, we spend a lot of effort, to get moves with different depths, and we have to choose one. Which is not an exact science. And where the effort expended is not all "on point" because searching one move deeper than another gives you no useful information in the above case.

I've already said we used split at the root in 1983 to win the WCCC in NYC. But I have also been quite careful to give real speedup numbers, not hypothetical guesses that are really "way out there" with respect to reality. When you go unsynchronized, you are keeping everything busy, but you are doing a _lot_ of work that is absolutely wasted. In the above case, searching one move 5 plies deeper and then taking the result of the other move is just wasted effort, and is _not_ a "parallel speedup" of any kind.

That has been my point all along. Not that it won't/can't work. Just that it won't/can't work very _efficiently_. That's why I have been talking about speedup... And that is the interesting number when discussing parallel search. Any idiot can keep all processors/nodes busy 100% of the time. The difficult part is to keep all that stuff busy doing things that are really useful. This is _not_ one of those cases.

bob · Post by **bob** » Fri Mar 13, 2009 5:31 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.
Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10
If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...
The good moves most of the time take longer to search but not always.
It is possible to have a move that seems bad at small depth but the evaluation is changed at high depth.

With normal search you will not get the high depth so you are not going to see it.
With a cluster that search root moves to different depths you can see it.

Uri
And that does exactly what? we are right back to the case of different depths for different scores. You also have the reverse. The high score at depth N drops at depth N+3 but you don't get to search it that deeply. So you play this losing move, when you had a better move at depth N+3 but didn't know it because its actual score was worse than the score we are using to choose the bad move.

I consider this entire discussion hopeless, as we are flipping a coin or rolling the dice when we mix and match depths vs scores... We are also discussing typical speedup, not speedup in a bizarre case. The case you mentioned is an atypical case and I don't really care about the speedup for the oddball examples, just what is going to happen over the course of the game with respect to speedup. And it ain't going to be 2.5x.
The only way to know what happens is by testing.
The case of high score at depth N that drops at depth N+3 can happen but I do not think that it happens often and even if it happens it is no worse than normal search when you probably do not get depth N+3 but only depth N.

Uri

I will make the same statement I have made several times previously. I've not said "this won't work". I did this in 1983 and won the WCCC doing it. But what I _have_ said is that "this won't work efficiently." The entire premise of parallel computing is to get all processors/nodes busy working on stuff that is _useful_. Searching moves to different depths is _not_ useful. It doesn't hurt, I agree. But it also doesn't help at all. You could just as well send a few of those nodes out back and let them stir around in your local septic tank. You will keep them busy. And they will help you play better chess about as well as they will help you fly.

So you can keep them busy doing something useful. Or you can keep them busy doing stuff that helps you not at all. This unsynchronized split at the root is no better than a synchronized split at the root for the most part, and that is a speedup of 1.5x or less on average. period.

All this other discussion is just smoke, noise, and in the case of the septic tank above, something that stinks like hell. If it smells like crap. Looks like crap. Feels like crap. guess what???

bob · Post by **bob** » Fri Mar 13, 2009 5:33 pm

Vasik Rajlich wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.
Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10
If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...
The good moves most of the time take longer to search but not always.
It is possible to have a move that seems bad at small depth but the evaluation is changed at high depth.

With normal search you will not get the high depth so you are not going to see it.
With a cluster that search root moves to different depths you can see it.

Uri
And that does exactly what? we are right back to the case of different depths for different scores. You also have the reverse. The high score at depth N drops at depth N+3 but you don't get to search it that deeply. So you play this losing move, when you had a better move at depth N+3 but didn't know it because its actual score was worse than the score we are using to choose the bad move.

I consider this entire discussion hopeless, as we are flipping a coin or rolling the dice when we mix and match depths vs scores... We are also discussing typical speedup, not speedup in a bizarre case. The case you mentioned is an atypical case and I don't really care about the speedup for the oddball examples, just what is going to happen over the course of the game with respect to speedup. And it ain't going to be 2.5x.
The only way to know what happens is by testing.
The case of high score at depth N that drops at depth N+3 can happen but I do not think that it happens often and even if it happens it is no worse than normal search when you probably do not get depth N+3 but only depth N.

Uri
This just needs a test. There is no way to deduce what the speedup is going to be. My intuition tells me that it will be well over 2.

Vas

My intuition has not changed in 30 years on this topic. You keep all processors busy. Doing what for the most part is wasted effort. If you search move A to depth N, and move B to depth N+2, the work getting to N+2 is not well spent. Particularly when the usual case happens and the depth N move is actually the best one.

That's what this really boils down to. Doing useful or useless work. This approach has, and still does do useless work. And doing useless work does _not_ add to a parallel speedup.

bob · Post by **bob** » Fri Mar 13, 2009 5:37 pm

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: Are these figures in your view legit?
I've not verified them, and they're in an academic paper, so they're almost certainly lies
Seriously, I've not tried root parallelization yet because I had a very highly optimized version with local mutexes. I will try it though, if only because I have a cluster now and it's obviously much easier to get running.

But I'd be surprised if they dared publish it if NOTHING was true.

Is the 3.0 speedup a legit indicator that the serial algorithm is suboptimal?
It should be.

I've not seen you really refute Bob's point about the limits of speedups with your algorithm, yet your claims suggest a much higher strength gain than would be expected. I'm inclined to think the above result is an explanation for that, but you're the only one who knows for sure. It's also possible what Bob thinks you are doing is wrong, but your arguing doesn't really suggest so.

I have seen one particularly egregious type error, not that it happens here. This error comes from the following poor science:

1. You start with a working sequential code.

2. You modify this extensively to work on a cluster or parallel machine.

3. You run this modified code with 1 node, then with 2, etc and compute the speedups from this data.

And this is wrong. The parallel data should be compared to the code from (1) above that was not hacked and made much less efficient in order to use the cluster. yet this happens regularly.

The correct speedup data is _always_ "best sequential" vs "best parallel". I've always done my testing like that.

bob · Post by **bob** » Fri Mar 13, 2009 5:39 pm

Zach Wegner wrote:
Gian-Carlo Pascutto wrote:I think I should clarify this. I do believe it's useful to search alternates more deeply, and I do not have any problems comparing moves from different depths. But the question is what to do with the bounds.

I mean, if you don't have an alpha value yet from your first move, what are you going to do with the others? If you don't kick off searches for alternates until you have an alpha, your speedup will obviously suck donkey balls.

If you don't have an alpha score, I can imagine 2 things:

a) You use the previous alpha. This sucks, because the score does usually change from iteration to iteration, and the score you get for an alternate might be useless.
b) You use a small window around the previous alpha. This sucks, because non-zero window searches suck.
One possibility is this (another idea from Tony): use small windows around the previous score for each move. As an advantage, you get multi PV for free.

We apparently have a different concept of the word "free" then.

This is anything but "free". It is a very expensive "free" since non-zero-window searches, and even worse, searches with the wrong bounds, are very expensive...

Another possibility is a mutation of MTD(f). I'm just making this up here, so bear with me. Each move uses its own previous alpha as a starting bound, then does a binary-type search to narrow the score. This part is like Tony's idea except replacing the aspiration window with a series of scout searches. But, you also maintain a global lower bound on the real score that you can pass around to each node. If for move 7 you know that the score is less than +1.00 due to a scout search, and then move 4 returns +1.50, you get a cutoff and can abandon move 7.

Of course, you don't actually get the "previous alpha" with this. You'd have to just use whatever information you have on the move, or you can use the real alpha of the previous iteration. Also, you have to decide what to do if you are doing an initial search around 0.00, and another move comes back as >1.00. Do you abandon the search and start again around 1.01, or hope for a <=0.00 score?

Clustering etc. thread

Re: Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: Results from UCT parallelization

Re: Results from UCT parallelization