Clustering etc. thread

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Mar 12, 2009 4:44 pm

bob wrote: It would seem to me that this is the _best_ case for a distributed search. When a bound changes at the root, for example, hash info is invalidated anyway.

Quite a few of the hash/best moves in the subtrees will still be valid, and they might be as important as having the bounds.

bob · Post by **bob** » Thu Mar 12, 2009 4:46 pm

Zach Wegner wrote:
bob wrote:I think the truth lies somewhere in the middle. Each individual search thread will broadcast to the other 70 nodes, not 560 as I had originally mentioned. But with 8 threads on a node, the number of sends will still be around 552 per unit of time since each of the 8 threads on a node can broadcast to the other 69 nodes. So it is a high-traffic deal, still...
It seems to be worse than that. 560 cpus (senders) * 69 nodes (receivers)=38640 hash stores for every time unit. You obviously don't want that for every node. Of course I can't imagine anyone would hash qsearch in this way (I keep qsearch hash process-local anyways), so that cuts it down by a factor of 2 or so. The hash sharing clearly needs to be greatly limited on such high numbers of nodes.

I'm also not sure you want to share every node anyways. That effectively means you're using 560 cpus with a hash table of a few GB. If the search is done properly, two random nodes should have very little overlap in their searches, which means hash sharing between them should be useless.

In another post I addressed this. I have considered four options here:

(1) ignore it completely. Easy.

(2) broadcast up to some depth limit to limit the number of broadcasts per unit of time. As the depth limit is reduced, the broadcast rate will go down.

(3) only "broadcast" to the nodes working on a common position, which will greatly reduce traffic.

(4) do an asynchronous probe below some depth limit so that any node can ask another node for a hash entry directly (a shared, distributed hash) but not too close to the tips.

I suspect I will end up testing these to decide which is better, at present I like 3 followed by 2, with 3 getting the early nod. But whether that will hold up in testing is unknown...

Uri Blass · Post by **Uri Blass** » Thu Mar 12, 2009 4:48 pm

bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.
Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10
If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...

The good moves most of the time take longer to search but not always.
It is possible to have a move that seems bad at small depth but the evaluation is changed at high depth.

With normal search you will not get the high depth so you are not going to see it.
With a cluster that search root moves to different depths you can see it.

Uri

bob · Post by **bob** » Thu Mar 12, 2009 4:49 pm

Gian-Carlo Pascutto wrote:
bob wrote: It would seem to me that this is the _best_ case for a distributed search. When a bound changes at the root, for example, hash info is invalidated anyway.
Quite a few of the hash/best moves in the subtrees will still be valid, and they might be as important as having the bounds.

I don't doubt that. But in a normal don't-fail-high-or-low-at-the-root search, way more of the entries are perfectly usable as is. Once you fail high or low at the root, one edge of the search space changes and hash entries based on that edge become useless unless they are LOWER entries that at least give us a good first move to try.

bob · Post by **bob** » Thu Mar 12, 2009 4:53 pm

Uri Blass wrote:
bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.
Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10
If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...
The good moves most of the time take longer to search but not always.
It is possible to have a move that seems bad at small depth but the evaluation is changed at high depth.

With normal search you will not get the high depth so you are not going to see it.
With a cluster that search root moves to different depths you can see it.

Uri

And that does exactly what? we are right back to the case of different depths for different scores. You also have the reverse. The high score at depth N drops at depth N+3 but you don't get to search it that deeply. So you play this losing move, when you had a better move at depth N+3 but didn't know it because its actual score was worse than the score we are using to choose the bad move.

I consider this entire discussion hopeless, as we are flipping a coin or rolling the dice when we mix and match depths vs scores... We are also discussing typical speedup, not speedup in a bizarre case. The case you mentioned is an atypical case and I don't really care about the speedup for the oddball examples, just what is going to happen over the course of the game with respect to speedup. And it ain't going to be 2.5x.

Uri Blass · Post by **Uri Blass** » Fri Mar 13, 2009 6:18 am

bob wrote:
Uri Blass wrote:
bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.
Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10
If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...
The good moves most of the time take longer to search but not always.
It is possible to have a move that seems bad at small depth but the evaluation is changed at high depth.

With normal search you will not get the high depth so you are not going to see it.
With a cluster that search root moves to different depths you can see it.

Uri
And that does exactly what? we are right back to the case of different depths for different scores. You also have the reverse. The high score at depth N drops at depth N+3 but you don't get to search it that deeply. So you play this losing move, when you had a better move at depth N+3 but didn't know it because its actual score was worse than the score we are using to choose the bad move.

I consider this entire discussion hopeless, as we are flipping a coin or rolling the dice when we mix and match depths vs scores... We are also discussing typical speedup, not speedup in a bizarre case. The case you mentioned is an atypical case and I don't really care about the speedup for the oddball examples, just what is going to happen over the course of the game with respect to speedup. And it ain't going to be 2.5x.

The only way to know what happens is by testing.
The case of high score at depth N that drops at depth N+3 can happen but I do not think that it happens often and even if it happens it is no worse than normal search when you probably do not get depth N+3 but only depth N.

Uri

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:02 am

Gian-Carlo Pascutto wrote:
bob wrote:
Gian-Carlo Pascutto wrote: What's the difference between UCT with a tiny exploration term (=current top go programs) and an alpha-beta search with very heavy LMR (=current top chess programs), in terms of nodes explored?
Not quite the same thing IMHO. While we are certainly reducing, we are not doing any sort of MC analysis (which is based on very fast game speeds). (...)
MC analysis is just the evaluation function, or the quiescent search, if you wish. I asked a question about nodes explored and you give a (flawed) argument that the evaluation is not the same.

MC analysis with a varying number of samples is equivalent to an evaluation function with lazy exits, from the point of view of the tree search.

And growing the tree based on a sort of probability model. In chess, at least my program, I am not doing any probabilistic analysis,
Really? What is LMR you think? You're probabilistically reducing moves besides the first (few), on the observation that in general your move ordering is likely to be correct. How is this different from UCT favoring the PV and exploring the alternates less?

I am just dynamic/static criteria to decide what looks interesting to search and what appears to be almost useless. But rather than giving up on a move as useless, it just gets reduced in depth to decrease effort required to dismiss it.
Exactly as UCT does.

It's an interesting point. You could say that once the go programmers figured out the right way to evaluate leaf nodes, and once the chess programmers figured out how important the main line really is and how well LMR-type things can work, the two approaches to tree search converged.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:05 am

Gian-Carlo Pascutto wrote:
M ANSARI wrote: It seems that VGA GPU would be perfect for generating MC output at a very fast rate ...
Not at all. All current GPU's have bad behaviour with branches. Avoiding branches in chess code is difficult if not unreasonable. Nobody has a good working implementation so far, and with reason!

I think it's safe to say that with current GPU hardware, nobody is going to be able to produce a strong program. And that's because the hardware sucks.

Vas also mentioned that MC is very easy to get good parralellization
It's not. Vas, you're wrong: strong MC (meaning, with tree search) doesn't parallelize any easier than alpha-beta. I've given some hints in my discussion with Bob about why.

Ok, I haven't worked on this, but intuitively I would think that the inherent randomness would really help with the scaling. Can't you just do X completely independent MC runs and sum up the data (somehow) at the end?

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:11 am

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote:
What were the UCT speedups for tree splitting and multiple runs?

It is quite possible that our serial searches are somehow too "structured" and not soft/stochastic enough. It is a little insane to build alpha-beta trees around 30-move main lines.

Vas
http://www.cs.unimaas.nl/g.chaslot/pape ... elMCTS.pdf

Thanks.

Quoting the relavant table:

---
Table 2. Root parallelization
Number of Winning Number Condence GPS Strength
threads Percentage of games interval speedup speedup
1 26.7 % 2000 2.2 % 1 1.0
2 38.0 % 2000 2.2 % 2 3.0
4 46.8 % 2000 2.2 % 4 6.5
16 56.5 % 2000 2.2 % 16 14.9
---

Are these figures in your view legit? Is the 3.0 speedup a legit indicator that the serial algorithm is suboptimal?

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Fri Mar 13, 2009 7:14 am

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: In practice, sharing high-depth entries doesn't really change the situation that much. You want the full search state for whatever part of the tree you are searching or your performance will plummet. Try disabling the lowest-depth 98% of your hash writes in Sjeng and see what happens.

Vas
I do agree there is probably a large performance gain for somehow distributing searches to nodes that have most info about them (and root splitting make this easy).

That's right. Broadcasting some high-depth entries will hardly put a dent in this.

Have you tried playing your cluster with broadcasting of high-depth entries vs your cluster without the broadcasts?

Vas

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?