Clustering etc. thread

Zach Wegner · Post by **Zach Wegner** » Thu Mar 12, 2009 4:00 pm

bob wrote:Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I can actually see this being a big problem if you don't significantly share the hash. Say one node, searching deep in the tree, splitting about as far out as you will allow. Then, due to a root fail low, the tree changes, and this node ends up searching some completely different tree that has nothing in common with the first one, so it has almost zero hash information to go on. Of course, this also happens on a single processor, just to a lesser extent. I guess the trick is just to share hash as much as possible.

bob · Post by **bob** » Thu Mar 12, 2009 4:04 pm

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
Sorry, but no you won't. The math is simple and comes directly from Amdahl's law and the original alpha/beta analysis by knuth and moore, followed by the parallel analysis I did in the Journal of parallel computing. The first move takes 50% of the time. And that is going to be done on one processor. if you can somehow solve the "window" problem, so that you can search the remainder of the ply-1 moves in parallel with a correct alpha/beta window, then you can shrink those to the time required to search one of them. But you have no alpha/beta window, so all moves searched in parallel (ignoring the first move) are going to take about as long as the first move (each one will take that long) because you do _not_ yet have the alpha bound from the first move to let you get the quick cutoffs on the rest of the move.

Best case is 2x faster, since you could assume that any root move takes as long as any other when you do not have a good alpha bound. And what you can get "peak" is not going to be what you get "on average". >2x is just not going to happen except for rare cases. I have a position somewhere where you can get a 20x speedup like that. First move takes forever to find a very deep mate. Second move is a shorter mate in 1/20th the total nodes searched. Searching both at the same time finds the second mate before the first has even been "sniffed". But that is simply an exception. For the general case, 2x is the very best you can hope for, and it is not going to happen often...

I can give you a citation for the paper I wrote that uses the math from Knuth / Moore and extends it to cover alpha/beta in parallel. It is easy enough to understand and explains exactly why >2x is a rare case, not the norm...
The point is that that 2x is not the upper bound. You'll have 2x more time to dedicate to searching the best move and maybe 20x more time to dedicate to searching alternative moves.
Let's keep this mathematically valid. You already stated, and I agree, that the first move at the root for iteration N takes about 50% of the total search time. And it is quite often much more than this:

Here are a couple of examplles:
Code: Select all
               20->   3&#58;22  -0.01   14. ... O-O 15. Bb2 f5 16. Bd3 c5 17.
                                    Qe2 cxd4 18. Bxd4 Qe7 19. Bxa7 Bxf3
                                    20. gxf3 Bxh2+ 21. Kxh2 Qh4+ 22. Kg2
                                    Qg5+ 23. Kh1 Qh4+ 24. Kg1 Qg5+ 25.
                                    Kh1 &#40;s=2&#41;
               21     5&#58;01  -0.05   14. ... O-O 15. Bb2 f5 16. Bd3 c5 17.
                                    Qe2 Qe7 18. Rfd1 Rfd8 19. Nd2 cxd4
                                    20. Bxd4 Ne5 21. Nc4 Nxd3 22. Nxd6
                                    Qxd6 <HT>
               21->   5&#58;11  -0.05   14. ... O-O 15. Bb2 f5 16. Bd3 c5 17.
                                    Qe2 Qe7 18. Rfd1 Rfd8 19. Nd2 cxd4
                                    20. Bxd4 Ne5 21. Nc4 Nxd3 22. Nxd6
                                    Qxd6 <HT>
First move at depth 21 took 1:39, rest of the moves took 10 seconds total.
Code: Select all
               20->   1&#58;24  -0.03   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
                                    17. Rfb1 Qc7 18. Rxb7 Qxb7 19. Rc1
                                    Rc8 20. dxc5 Rxc5 21. Rxc5 Bxc5 22.
                                    Bb4 Nf6 23. Qa5 Bxb4 24. axb4
               21     2&#58;40   0.05   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
                                    17. Rfb1 Qc7 18. Rxb7 Qxb7 19. Rc1
                                    Rc8 20. dxc5 Rxc5 21. Rxc5 Bxc5 22.
                                    Bb4 Nf6 23. Qa5 Bxb4 24. axb4 Qe7
               21->   2&#58;55   0.05   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
                                    17. Rfb1 Qc7 18. Rxb7 Qxb7 19. Rc1
                                    Rc8 20. dxc5 Rxc5 21. Rxc5 Bxc5 22.
                                    Bb4 Nf6 23. Qa5 Bxb4 24. axb4 Qe7
               22     4&#58;08   0.02   14. ... O-O 15. Bd2 c5 16. Bxb7 Rxb7
first move at depth 21 took 1:16, rest of the moves took 15 seconds. All you can get there is a speedup on that extra 15 seconds. You will barely run faster at all. Here is a third case where Crafty actually changes its mind (these are from the same game, consecutive moves to be fair. The following case preceeds the previous two.
Code: Select all
               18->   1&#58;01   0.23   13. ... O-O 14. Rb1 Rb8 15. Qa4 f5
                                    16. Bc2 c5 17. dxc5 Bxc5 18. Bb2 Qe7
                                    19. Rfd1 Rfd8 20. e4 Nf6 21. Rxd8+
                                    Rxd8 22. exf5 Bxf3 23. gxf3 &#40;s=2&#41;
               19     1&#58;34   0.31   13. ... O-O 14. Qc2 Nf6 15. Rb1 Ba6
                                    16. Bd3 Bxd3 17. Qxd3 Rb8 18. Rxb8
                                    Qxb8 19. e4 Be7 20. e5 Nd5 21. Ng5
                                    Bxg5 22. Bxg5 Qb5
               19     2&#58;18   0.16   13. ... Rb8 14. Rb1 O-O 15. Qa4 f5
                                    16. Bc2 c5 17. dxc5 Bxc5 18. Bb2 Qe7
                                    19. Rfd1 Nb6 <HT>
               19->   2&#58;18   0.16   13. ... Rb8 14. Rb1 O-O 15. Qa4 f5
                                    16. Bc2 c5 17. dxc5 Bxc5 18. Bb2 Qe7
                                    19. Rfd1 Nb6 <HT> &#40;s=2&#41;
First move at 19 took 33 seconds, next move took 44 seconds, rest of the moves are buried in the time for the second move. Here you might actually get a speedup of almost 2, assuming your split at the root algorithm is good enough to make sure those two moves are given to different clusters (there is no indication that they are the two best moves, the second never appeared in previous searches.)

So > 2x is simply not going to happen except in very rare cases. Most of the time, if you don't change your mind, you are looking at 1.1x or so as the above cases show. I have several long games I can post if someone wants to go thru them as I did above and compute the average speedup for the entire game...

It's not hard to do a simulation. Crafty A always plays its move after 10 seconds of search using his normal algorithm, while Crafty B spends 10 seconds on every single root move and then plays the one with the best score.

I would be curious to know the result here.

Vas
I assume you mean "pure best score" without regard to depth? I can probably pull this off easily enough.

I might even be able to find monty's old test results as he did exactly this on his data general "better-than-a-cluster cluster" years ago (he had a shared I/O bus that was way faster than net-based communication.

I would expect the split at the root version to win, because there is some performance to be gained, although there is some lost as well due to no shared hash. I'd expect the difference to be in the 20-30 Elo range based on the expected speedup. Remember, in many searches the best move is the best move for all searches, and the extra searching on other moves produces zero.
Bob -

yes, you'll want to take the best score without regard to depth.

The math you quote applies to the time available for the first move. That's different than total effective speedup.

Vas

No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy) the sequential search for ply 21 was 109 seconds. The time to search the first move was 99 seconds. Since we are talking about splitting at the root, that first move will be given to a single node and it is going to take 99 seconds to search. Even if you can somehow search the remaining moves in parallel on other nodes, with no overhead caused by not knowing the correct alpha/beta bound for those searches since the first move has not yet been searched, the parallel search is going to take exactly 99 seconds, minimum. And if not knowing the bound causes one of the other moves to take even longer, this gets worse. So the maximum parallel speedup you can get for that move is 109 (sequential time) / 99 (time for longest parallel task). Not very good.

I don't see how you can cheat the search space limits and get around that rather stark reality. On the third case, things are better. Sequential time is 77 seconds for last full iteration. move 1 took 33 seconds. Best move took something less than 44 seconds, but I can't tell exactly how much. So parallel time will be 44 seconds or less, although not a lot less. If we use 44, we get a speedup of 77/44. If we guess that the second move took about as long as the first, since the rest of the moves are mixed up in the time to get the second move, we could use 33 seconds and get a speedup of 77/33. Which produces a speedup somewhere between 1.75 and 2.33x. Most of the time the move doesn't change however, so we get stuck closer to that 109/99 (1.1x speedup) over 80% of the time...

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Mar 12, 2009 4:07 pm

bob wrote: No it isn't. There is a big difference between searching deeply to find the best move, and doing a shallow search terminated by a bunch of very fast games to try to predict the quality of that branch. This doesn't probe anywhere near as deeply as the original search the original search.

You're effectively arguing that there is a big difference between evaluating and searching, because doing a shallow search followed by an evaluation to predict the quality of the branch doesn't probe as deeply as the original search.

I agree on that, but it means nothing for the discussion.

Of course a deep search is better than a shallow search. That's why (strong) go programs search deep: searching deeper wins games.

LMR is simply a methodology to not look at some moves as deeply as others. I do zero probability analysis to make the decisions. A move is reduced or not based on a simple static rule. Significantly different IMHO. It is different based on the way UCT decides what to explore.

But HOW is it different? Assuming your LMR at least uses move ordering rules, this means the moves that are scored the highest get a deep search and everybody else gets a less deep search. Your move ordering is probably based on pv/hash, killers, captures, etc, and near the root it might be based on nodecounts from previous searches.

What does UCT do? For nodes it doesn't have much data on, it generally uses move ordering based on static rules, for nodes with more data you use the scores from previous searches. Then based on this move ordering you search the highest scoring moves the deepest and the others less deep.

In UCT you can tune this behaviour by fiddling with the exploration constant.

In LMR you can tune this behaviour by reducing more moves and changing the reduction depth.

bob · Post by **bob** » Thu Mar 12, 2009 4:09 pm

Tony wrote:One of us is not getting it.

If at the root there are 30 moves, and after move 1 en 2 each has again 30 moves and we send those 88 positions to cluster clients, the total search will not speed up by 2.5 compared to a single client ?

Tony

Look at the data I posted. The total parallel search time is simply the longest time required to search any single root move. Which in the examples I gave produce a speedup of 1.1x or so for the first two, and between 1.75 and 2.33x for the third, which only happens when you change your mind on the best move in mid-iteration. That is a roughly 15% probability event based on prior analysis.

I don't understand the "88 positions". Spitting at the root means each possible root move is made and that position is sent to a cluster node to search. If you have 30 moves at the root, you can get, at most, 30 cluster nodes busy. And the real issue is the amount of time required to search the _first_ move which is the majority of the time used. And that limits the speedup drastically...

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Mar 12, 2009 4:17 pm

bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)

Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10

bob · Post by **bob** » Thu Mar 12, 2009 4:22 pm

Gian-Carlo Pascutto wrote:I'm not sure if anyone of you knows this:

http://citeseerx.ist.psu.edu/viewdoc/su ... .1.23.3861

As far as I could gather, this is extremely close to UCT. They have results for parallelized Crafty

I never tried this myself, though.

Actually this is an independent driver that uses crafty as the "endpoint evaluation". It does a shallow search and that is the "evaluation value" for a position. There have been others, although not "UCT-like". Aphid was one.

Zach Wegner · Post by **Zach Wegner** » Thu Mar 12, 2009 4:24 pm

bob wrote:I think the truth lies somewhere in the middle. Each individual search thread will broadcast to the other 70 nodes, not 560 as I had originally mentioned. But with 8 threads on a node, the number of sends will still be around 552 per unit of time since each of the 8 threads on a node can broadcast to the other 69 nodes. So it is a high-traffic deal, still...

It seems to be worse than that. 560 cpus (senders) * 69 nodes (receivers)=38640 hash stores for every time unit. You obviously don't want that for every node. Of course I can't imagine anyone would hash qsearch in this way (I keep qsearch hash process-local anyways), so that cuts it down by a factor of 2 or so. The hash sharing clearly needs to be greatly limited on such high numbers of nodes.

I'm also not sure you want to share every node anyways. That effectively means you're using 560 cpus with a hash table of a few GB. If the search is done properly, two random nodes should have very little overlap in their searches, which means hash sharing between them should be useless.

bob · Post by **bob** » Thu Mar 12, 2009 4:29 pm

Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. Here's why. Speedup is the time for the sequential search divided by the time for the parallel search. In the first example above (and lets stop at a fixed depth to make the comparison easy)
Stop right here. It makes the comparison easy, but it also makes it invalid, because you DONT HAVE equal depths.

Previously mentioned. And unequal depths don't matter since we have already given up trying to factor in "depth" to help choose the best move, and we just take the pure best score. In general, the "good moves" take longer to search, period. The bad moves "fly by". So you go deeper on the lousy moves. For example, in the last ACCA event where I noticed what Rybka was doing, Crafty played QxSomething where there was only one way to recapture the queen. And we were seeing much deeper depths from the processors that were not recapturing the queen. And how exactly does that speed up anything?

It should be obvious that having this info:

e4 depth 20 ply score -0.10
d4 depth 23 ply score -0.15
c4 depth 25 ply score -0.20

is more interesting than having this info:

e4 depth 20 ply score -0.10
d4 depth 20 ply score <=-0.10
c4 depth 20 ply score <=-0.10

If that were how it worked, perhaps. But I'd bet e4, d4 and c4 will reach about the _same_ depth for obvious reasons. You might get a 25 ply result for Qxh7 where the queen is simply lost by Rxh7. But that doesn't help as much. This is more an issue one can see in a multi-PV output such as crafty's annotate command produces. And you can get a feel for what goes faster and what goes slower. The crap goes fastest. The good stuff is all pretty much equally slow.

BTW I do not quite see how the first case is "more interesting". I see no way to use that information to make an informed decision about which of the three moves is best...

bob · Post by **bob** » Thu Mar 12, 2009 4:37 pm

Zach Wegner wrote:
bob wrote:Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...
I can actually see this being a big problem if you don't significantly share the hash. Say one node, searching deep in the tree, splitting about as far out as you will allow. Then, due to a root fail low, the tree changes, and this node ends up searching some completely different tree that has nothing in common with the first one, so it has almost zero hash information to go on. Of course, this also happens on a single processor, just to a lesser extent. I guess the trick is just to share hash as much as possible.

It would seem to me that this is the _best_ case for a distributed search. When a bound changes at the root, for example, hash info is invalidated anyway. And since there is no hash info on the other nodes anyway for the most part, the hit seems to be about the same for cluster and non-cluster. Not having a shared hash is always going to hurt. I'm certainly going to share some of the hash, but there are lots of ideas to try. For example. blasting to all nodes all reasonable-draft entries as they are stored is a possibility. As is just broadcasting the hash stores to nodes that are "helping" on a specific branch, to reduce overall transmission levels (note that this probably won't be a true UDP broadcast unless one can finagle the ip addresses in such a way that this is possible, which might be a pain but doable).

bob · Post by **bob** » Thu Mar 12, 2009 4:41 pm

Gian-Carlo Pascutto wrote:
bob wrote: No it isn't. There is a big difference between searching deeply to find the best move, and doing a shallow search terminated by a bunch of very fast games to try to predict the quality of that branch. This doesn't probe anywhere near as deeply as the original search the original search.
You're effectively arguing that there is a big difference between evaluating and searching, because doing a shallow search followed by an evaluation to predict the quality of the branch doesn't probe as deeply as the original search.

I agree on that, but it means nothing for the discussion.

Of course a deep search is better than a shallow search. That's why (strong) go programs search deep: searching deeper wins games.

LMR is simply a methodology to not look at some moves as deeply as others. I do zero probability analysis to make the decisions. A move is reduced or not based on a simple static rule. Significantly different IMHO. It is different based on the way UCT decides what to explore.
But HOW is it different? Assuming your LMR at least uses move ordering rules, this means the moves that are scored the highest get a deep search and everybody else gets a less deep search. Your move ordering is probably based on pv/hash, killers, captures, etc, and near the root it might be based on nodecounts from previous searches.

My LMR is based on simple static rules. I don't reduce the hash move, any captures, any checks, or killers. Everything else gets reduced, period.

What does UCT do? For nodes it doesn't have much data on, it generally uses move ordering based on static rules, for nodes with more data you use the scores from previous searches. Then based on this move ordering you search the highest scoring moves the deepest and the others less deep.

I understand that. But LMR doesn't use scores, or search results, etc to make any decisions. At least not mine. I have tried all sorts of things, including full static evals, the old history ideas, etc, and none are better in testing (in fact, all are worse than the current simple approach).

In UCT you can tune this behaviour by fiddling with the exploration constant.

In LMR you can tune this behaviour by reducing more moves and changing the reduction depth.

I can't adjust the number of moves to reduce, and the depth is fixed at 1. more is worse in every test I have tried with one small exception that I am exploring a bit when I have time.

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: Parallel randomized best-first minimax

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?