Clustering etc. thread

Vasik Rajlich · Post by **Vasik Rajlich** » Sat Mar 14, 2009 8:37 am

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: Ok, I haven't worked on this, but intuitively I would think that the inherent randomness would really help with the scaling. Can't you just do X completely independent MC runs and sum up the data (somehow) at the end?

Vas
Look at the leaf parallelization results in the paper.

It's a bit like trying to optimize your chessprogram by disabling lazy eval and parallelizing the eval components

It would have to be more granular. The "root" parallization results in the paper do exploit this randomness, FWIW.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Sat Mar 14, 2009 8:43 am

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: Are these figures in your view legit?
I've not verified them, and they're in an academic paper, so they're almost certainly lies
Seriously, I've not tried root parallelization yet because I had a very highly optimized version with local mutexes. I will try it though, if only because I have a cluster now and it's obviously much easier to get running.

But I'd be surprised if they dared publish it if NOTHING was true.

Is the 3.0 speedup a legit indicator that the serial algorithm is suboptimal?
It should be.

I've not seen you really refute Bob's point about the limits of speedups with your algorithm, yet your claims suggest a much higher strength gain than would be expected. I'm inclined to think the above result is an explanation for that, but you're the only one who knows for sure. It's also possible what Bob thinks you are doing is wrong, but your arguing doesn't really suggest so.

I prefer to keep my algorithm private for now and just discuss the theoretical aspects.

Root splitting has a very limited ability to search the best move more deeply, obviously. It compensates by searching alternative moves relatively more deeply.

The so-called "root" algorithm in the paper you quoted has nothing to do with splitting at the root, as far as I can tell.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Sat Mar 14, 2009 8:49 am

Gian-Carlo Pascutto wrote:
Gian-Carlo Pascutto wrote: I've not seen you really refute Bob's point about the limits of speedups with your algorithm, yet your claims suggest a much higher strength gain than would be expected. I'm inclined to think the above result is an explanation for that, but you're the only one who knows for sure. It's also possible what Bob thinks you are doing is wrong, but your arguing doesn't really suggest so.
I think I should clarify this. I do believe it's useful to search alternates more deeply, and I do not have any problems comparing moves from different depths. But the question is what to do with the bounds.

I mean, if you don't have an alpha value yet from your first move, what are you going to do with the others? If you don't kick off searches for alternates until you have an alpha, your speedup will obviously suck donkey balls.

Of course, life is better when you have an alpha. The various fine-grained parallelization techniques like YBW and DTS completely revolve around this point - they always first get an alpha, at any cost.

This cost is sometimes quite high. With DTS, you'll split at points you otherwise would never want to split at, just to avoid having to work without an alpha.

If you don't have an alpha score, I can imagine 2 things:

a) You use the previous alpha. This sucks, because the score does usually change from iteration to iteration, and the score you get for an alternate might be useless.
b) You use a small window around the previous alpha. This sucks, because non-zero window searches suck.

'Sucks' is a big word.

If you really believe this, why don't you use MTD (f) for your serial search?

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Sat Mar 14, 2009 8:52 am

Zach Wegner wrote:
Vasik Rajlich wrote:The problem is that there is no good way to share the hash. Broadcasting a handful of high-depth entries is just a drop in the bucket.

The algorithm needs to be designed around this issue.
Pretty much, but I'd generalize it to latency in general. I have quite a few ideas on how to handle this, but I'm just in the design phase now. Generally I believe that the nodes should be heterogeneous in how they communicate with each other, not just based on split point locality, but also on network latency.

For better or for worse, this thread is making me want to start building the cluster already...

There are two things you can do:

1) You can design some complex communication scheme.
2) You can design a work splitting algorithm which requires less communication in the first place.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Sat Mar 14, 2009 8:56 am

Zach Wegner wrote:
Gian-Carlo Pascutto wrote:I think I should clarify this. I do believe it's useful to search alternates more deeply, and I do not have any problems comparing moves from different depths. But the question is what to do with the bounds.

I mean, if you don't have an alpha value yet from your first move, what are you going to do with the others? If you don't kick off searches for alternates until you have an alpha, your speedup will obviously suck donkey balls.

If you don't have an alpha score, I can imagine 2 things:

a) You use the previous alpha. This sucks, because the score does usually change from iteration to iteration, and the score you get for an alternate might be useless.
b) You use a small window around the previous alpha. This sucks, because non-zero window searches suck.
One possibility is this (another idea from Tony): use small windows around the previous score for each move. As an advantage, you get multi PV for free.

Another possibility is a mutation of MTD(f). I'm just making this up here, so bear with me. Each move uses its own previous alpha as a starting bound, then does a binary-type search to narrow the score. This part is like Tony's idea except replacing the aspiration window with a series of scout searches. But, you also maintain a global lower bound on the real score that you can pass around to each node. If for move 7 you know that the score is less than +1.00 due to a scout search, and then move 4 returns +1.50, you get a cutoff and can abandon move 7.

Of course, you don't actually get the "previous alpha" with this. You'd have to just use whatever information you have on the move, or you can use the real alpha of the previous iteration. Also, you have to decide what to do if you are doing an initial search around 0.00, and another move comes back as >1.00. Do you abandon the search and start again around 1.01, or hope for a <=0.00 score?

In general - any time you consider a search with a non-zero window, you can replace it with a sequence of scout (zero-window) searches.

One of the differences between the two methods is the order in which you get information.

Vas

Vasik Rajlich · Post by **Vasik Rajlich** » Sat Mar 14, 2009 8:57 am

bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.

Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...

That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.

My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.

Vas
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
As long as each node works on the same thing from iteration to iteration, you'll be fine. The lack of shared memory will hardly matter - you'll miss a handful of 'remote transpositions', which is no big deal.

As soon as a node takes on new work, you'll get massive inefficiency.

Vas
I don't agree with that at all. Lots of past experimentation by myself and others ( did a sorta-cluster Cray Blitz once but it used Cray's HIPPI channel which was not exactly high-latency) has shown just how significant the loss of transpositions across the nodes is, and it is non-trivial. And is never goine to come even remotely close to what we can do on a fully-shared memory system.

There is significant interaction between different root moves, and this continues to grow as the game moves toward the endgame. Fine 70 is probably the ultimate type position where this shows up best (although anyone can search this to 40+ plies instantly). But searching each root move on a different node greatly delays the depth at which you find the winning move Kb1, usually forcing this all the way out to ply 26 when most find it at 18-20 or so.

Do you have some exact data about this?

I agree that this issue will get more serious as pieces are traded and that in Fine 70 it will be a big deal.

Vas

Gian-Carlo Pascutto · Sat Mar 14, 2009 10:09 am

bob wrote: Except for the issues with a best-first vs depth-first search, where the later is _way_ more efficient, I would agree to an extent. But BF search does have issues with respect to searching stuff alpha/beta would toss out quickly.

BF and DF search can be made equivalent in most cases. It's been done for Alpha-Beta (MTD), proof-number-search (df-pn) and UCT.

Gian-Carlo Pascutto · Sat Mar 14, 2009 10:15 am

Vasik Rajlich wrote: The so-called "root" algorithm in the paper you quoted has nothing to do with splitting at the root, as far as I can tell.

In my view the similarity is because searching alternate root moves with a more open window is comparable in effect to running multiple UCT searches and relying on their randomness that they'll focus on seperate PV lines.

In both cases, there is some effect that lines which would have been discarded by LMR get more forced exploration.

bob · Post by **bob** » Sat Mar 14, 2009 5:14 pm

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.

Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...

That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.

My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.

Vas
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
As long as each node works on the same thing from iteration to iteration, you'll be fine. The lack of shared memory will hardly matter - you'll miss a handful of 'remote transpositions', which is no big deal.

As soon as a node takes on new work, you'll get massive inefficiency.

Vas
I don't agree with that at all. Lots of past experimentation by myself and others ( did a sorta-cluster Cray Blitz once but it used Cray's HIPPI channel which was not exactly high-latency) has shown just how significant the loss of transpositions across the nodes is, and it is non-trivial. And is never goine to come even remotely close to what we can do on a fully-shared memory system.

There is significant interaction between different root moves, and this continues to grow as the game moves toward the endgame. Fine 70 is probably the ultimate type position where this shows up best (although anyone can search this to 40+ plies instantly). But searching each root move on a different node greatly delays the depth at which you find the winning move Kb1, usually forcing this all the way out to ply 26 when most find it at 18-20 or so.
Do you have some exact data about this?

I agree that this issue will get more serious as pieces are traded and that in Fine 70 it will be a big deal.

Vas

I don't at the moment, but hang on for a couple of months. My first cluster approach is not going to deal with hashing, it will first deal with effective splitting approaches. I will be able to run this test then and can report on what I gain, for example, when using 8 threads on a single node (same splitting idea of course, and still using message passing to split, but where I can still share the entire hash table) as opposed to running the same way where I don't share the table...

However, I have looked at lots of such data in the past as various people did distributed search. Schaeffer was one that did this on Sun workstations. But I don't remember the results and I lost all of my old data back in 1995 or so..

bob · Post by **bob** » Sat Mar 14, 2009 5:16 pm

Gian-Carlo Pascutto wrote:
bob wrote: Except for the issues with a best-first vs depth-first search, where the later is _way_ more efficient, I would agree to an extent. But BF search does have issues with respect to searching stuff alpha/beta would toss out quickly.
BF and DF search can be made equivalent in most cases. It's been done for Alpha-Beta (MTD), proof-number-search (df-pn) and UCT.

I don't see how. Alpha/beta does not work the same on best-first. Depth-first has alpha/beta discarding moves for all time once they have been proven worse. In BF they just sit around to see if they begin to look interesting enough to examine further as other good moves begin to go south. BF also has the resulting huge memory requirement for today's search spaces... but it eliminates the need for huge hash tables.

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: Results from UCT parallelization

Re: Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?

Re: Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: Results from UCT parallelization

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?