Clustering etc. thread

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Results from UCT parallelization

Post by bob »

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: The so-called "root" algorithm in the paper you quoted has nothing to do with splitting at the root, as far as I can tell.
In my view the similarity is because searching alternate root moves with a more open window is comparable in effect to running multiple UCT searches and relying on their randomness that they'll focus on seperate PV lines.

In both cases, there is some effect that lines which would have been discarded by LMR get more forced exploration.
The issue I see is that random selection for expansion is what you do when you have no idea how to proceed based on some sort of knowledge, so MC methodology is the next thing that comes to mind. But trying to predict how a random selection compares to a more knowledge-based LMR approach seems futile, because we know random selection is purely random, while with LMR we can reduce that randomness by purposefully excluding some moves based on something a bit more accurate than random selection.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Results from UCT parallelization

Post by Gian-Carlo Pascutto »

bob wrote: The issue I see is that random selection for expansion is what you do when you have no idea how to proceed based on some sort of knowledge, so MC methodology is the next thing that comes to mind. But trying to predict how a random selection compares to a more knowledge-based LMR approach seems futile, because we know random selection is purely random, while with LMR we can reduce that randomness by purposefully excluding some moves based on something a bit more accurate than random selection.
UCT does not use random selection, it uses information known from past searches (the current average score so far). The small variances in average score are enough to sometimes select alternate moves. It won't suddenly start selecting bad moves.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: An idea for a new WCCC format - what do you think?

Post by Gian-Carlo Pascutto »

bob wrote:
Gian-Carlo Pascutto wrote: BF and DF search can be made equivalent in most cases. It's been done for Alpha-Beta (MTD), proof-number-search (df-pn) and UCT.
I don't see how. Alpha/beta does not work the same on best-first. Depth-first has alpha/beta discarding moves for all time once they have been proven worse. In BF they just sit around to see if they begin to look interesting enough to examine further as other good moves begin to go south. BF also has the resulting huge memory requirement for today's search spaces... but it eliminates the need for huge hash tables.
Aw come on. The results for both alpha-beta+TT equivalence with SSS and for proof-number search are both PhD thesi with rigorous mathematical proofs. Maybe the one for UCT is also, but I don't speak Japanese. There's no point in even arguing this one.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An idea for a new WCCC format - what do you think?

Post by bob »

Gian-Carlo Pascutto wrote:
bob wrote:
Gian-Carlo Pascutto wrote: BF and DF search can be made equivalent in most cases. It's been done for Alpha-Beta (MTD), proof-number-search (df-pn) and UCT.
I don't see how. Alpha/beta does not work the same on best-first. Depth-first has alpha/beta discarding moves for all time once they have been proven worse. In BF they just sit around to see if they begin to look interesting enough to examine further as other good moves begin to go south. BF also has the resulting huge memory requirement for today's search spaces... but it eliminates the need for huge hash tables.
Aw come on. The results for both alpha-beta+TT equivalence with SSS and for proof-number search are both PhD thesi with rigorous mathematical proofs. Maybe the one for UCT is also, but I don't speak Japanese. There's no point in even arguing this one.
I am specifically talking about UCT search, which is a best-first approach. There is no "equivalence" until you make a drastic assumption about the order in which moves are expanded in "best first", which can't be realized in real games... Note that the classical definition of "equivalence" is "searches exactly the same tree nodes" and that won't happen in the real world...

And then there is the alpha/beta issue which doesn't apply to BF anyway.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Results from UCT parallelization

Post by bob »

Gian-Carlo Pascutto wrote:
bob wrote: The issue I see is that random selection for expansion is what you do when you have no idea how to proceed based on some sort of knowledge, so MC methodology is the next thing that comes to mind. But trying to predict how a random selection compares to a more knowledge-based LMR approach seems futile, because we know random selection is purely random, while with LMR we can reduce that randomness by purposefully excluding some moves based on something a bit more accurate than random selection.
UCT does not use random selection, it uses information known from past searches (the current average score so far). The small variances in average score are enough to sometimes select alternate moves. It won't suddenly start selecting bad moves.
Perhaps we are not talking about the same UCT implementation then. My info about this is pretty old and I have not worked with it at all... When I first heard of it, one of the issues was a random element to prevent the usual local maxima/minima issue in traditional BFS.
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: Results from UCT parallelization

Post by Vasik Rajlich »

Gian-Carlo Pascutto wrote:
Vasik Rajlich wrote: The so-called "root" algorithm in the paper you quoted has nothing to do with splitting at the root, as far as I can tell.
In my view the similarity is because searching alternate root moves with a more open window is comparable in effect to running multiple UCT searches and relying on their randomness that they'll focus on seperate PV lines.

In both cases, there is some effect that lines which would have been discarded by LMR get more forced exploration.
Aha, I see. It's like cheap splitting, without much redundancy (due to the huge emphasis on the main line).

It would be interesting to try this with chess.

Vas
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: An idea for a new WCCC format - what do you think?

Post by Vasik Rajlich »

bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.
Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...
That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.
My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.

Vas
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
As long as each node works on the same thing from iteration to iteration, you'll be fine. The lack of shared memory will hardly matter - you'll miss a handful of 'remote transpositions', which is no big deal.

As soon as a node takes on new work, you'll get massive inefficiency.

Vas
I don't agree with that at all. Lots of past experimentation by myself and others ( did a sorta-cluster Cray Blitz once but it used Cray's HIPPI channel which was not exactly high-latency) has shown just how significant the loss of transpositions across the nodes is, and it is non-trivial. And is never goine to come even remotely close to what we can do on a fully-shared memory system.

There is significant interaction between different root moves, and this continues to grow as the game moves toward the endgame. Fine 70 is probably the ultimate type position where this shows up best (although anyone can search this to 40+ plies instantly). But searching each root move on a different node greatly delays the depth at which you find the winning move Kb1, usually forcing this all the way out to ply 26 when most find it at 18-20 or so.
Do you have some exact data about this?

I agree that this issue will get more serious as pieces are traded and that in Fine 70 it will be a big deal.

Vas
I don't at the moment, but hang on for a couple of months. My first cluster approach is not going to deal with hashing, it will first deal with effective splitting approaches. I will be able to run this test then and can report on what I gain, for example, when using 8 threads on a single node (same splitting idea of course, and still using message passing to split, but where I can still share the entire hash table) as opposed to running the same way where I don't share the table...

However, I have looked at lots of such data in the past as various people did distributed search. Schaeffer was one that did this on Sun workstations. But I don't remember the results and I lost all of my old data back in 1995 or so..
I'd like to see the different hashing effects somehow disentangled (ie. hash moves, transpositions close in the tree, and transpositions far in the tree).

Vas
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Results from UCT parallelization

Post by Gian-Carlo Pascutto »

Code: Select all

  F6 ->  244374 (U: 48.49%) (R: 49.03%:  257089) PV: F6 C4 D4 C3 C5 B5 C6 G7 F7
B6 C7 G6 G4 H4 H5 G5 H3 F4 G3 F5 F3 H6 J4 F8 E8
  D6 ->   70733 (U: 48.54%) (R: 49.02%:  133645) PV: D6 E7 F6 E3 C7 F7 G7 G8 H8
H7 G6 H9 J7 J8 J6
  C6 ->   10788 (U: 48.06%) (R: 50.59%:   48039) PV: C6 C7 D6 E7 B7 B8 B6
  E7 ->    8861 (U: 48.16%) (R: 49.32%:   76092) PV: E7 D6 D5 E6 F6 F5 G5 F4 C5

  C5 ->    3515 (U: 47.79%) (R: 49.85%:   35438) PV: C5 D6 F7 D5 D4
  G3 ->     363 (U: 46.74%) (R: 49.51%:   91917) PV: G3 F6 E7
====================================
244374 visits, score 48.49% (from 48.45%) PV: F6 C4 D4 C3 C5 B5 C6 G7 F7 B6 C7 G
6 G4 H4 H5 G5 H3 F4 G3 F5 F3 H6 J4 F8 E8

339277 visits, 445900 nodes, 15331 vps
This is some example output after a search. The first column is the move, the second one nodes searched, the third column the score, you can ignore the rest. The last is the mainline.

As you can see, the moves with the best scores (the mainlines) get a huge amount of search effort thrown at them, much more so than the alternates (there are more moves searched shown here, these are just the top scoring ones). But even the bad ones get some effort. You can say that UCT distributes search effort according to goodness.

If you watch closely, you'll also see that the second best move has a better score than the first move...this is exactly the case we discussed before where a move with low depth gets a better score than a move with much more depth. If I hadn't interrupted the search now, most effort would have been directed to D6, because that's the new best scoring move...like what happens if a move that was LMR'ed fails high :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Results from UCT parallelization

Post by bob »

Gian-Carlo Pascutto wrote:

Code: Select all

  F6 ->  244374 (U: 48.49%) (R: 49.03%:  257089) PV: F6 C4 D4 C3 C5 B5 C6 G7 F7
B6 C7 G6 G4 H4 H5 G5 H3 F4 G3 F5 F3 H6 J4 F8 E8
  D6 ->   70733 (U: 48.54%) (R: 49.02%:  133645) PV: D6 E7 F6 E3 C7 F7 G7 G8 H8
H7 G6 H9 J7 J8 J6
  C6 ->   10788 (U: 48.06%) (R: 50.59%:   48039) PV: C6 C7 D6 E7 B7 B8 B6
  E7 ->    8861 (U: 48.16%) (R: 49.32%:   76092) PV: E7 D6 D5 E6 F6 F5 G5 F4 C5

  C5 ->    3515 (U: 47.79%) (R: 49.85%:   35438) PV: C5 D6 F7 D5 D4
  G3 ->     363 (U: 46.74%) (R: 49.51%:   91917) PV: G3 F6 E7
====================================
244374 visits, score 48.49% (from 48.45%) PV: F6 C4 D4 C3 C5 B5 C6 G7 F7 B6 C7 G
6 G4 H4 H5 G5 H3 F4 G3 F5 F3 H6 J4 F8 E8

339277 visits, 445900 nodes, 15331 vps
This is some example output after a search. The first column is the move, the second one nodes searched, the third column the score, you can ignore the rest. The last is the mainline.

As you can see, the moves with the best scores (the mainlines) get a huge amount of search effort thrown at them, much more so than the alternates (there are more moves searched shown here, these are just the top scoring ones). But even the bad ones get some effort. You can say that UCT distributes search effort according to goodness.

If you watch closely, you'll also see that the second best move has a better score than the first move...this is exactly the case we discussed before where a move with low depth gets a better score than a move with much more depth. If I hadn't interrupted the search now, most effort would have been directed to D6, because that's the new best scoring move...like what happens if a move that was LMR'ed fails high :)
I understand how it works. I've used BFS in the past for other games. But we were on this "equivalence" topic and that's not what happens. That is, BFS and DFS are not "equivalent" except under impossibly tightly constrained conditions that are artificial with respect to real games.

That might well be the best answer for a big cluster search for all I know. I've not considered any alternatives to plain-old alpha/beta, because so far (excluding clusters) nothing has ever beat alpha/beta with respect to chess.

If I run into a buzzsaw on this cluster search, I may well give your idea a try to see what happens...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An idea for a new WCCC format - what do you think?

Post by bob »

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.
Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
That's what I thought (scout search).

The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...
That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).

This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.

Of course, root splitting alone isn't the answer.

Vas

I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.
My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.

Vas
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...

I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
As long as each node works on the same thing from iteration to iteration, you'll be fine. The lack of shared memory will hardly matter - you'll miss a handful of 'remote transpositions', which is no big deal.

As soon as a node takes on new work, you'll get massive inefficiency.

Vas
I don't agree with that at all. Lots of past experimentation by myself and others ( did a sorta-cluster Cray Blitz once but it used Cray's HIPPI channel which was not exactly high-latency) has shown just how significant the loss of transpositions across the nodes is, and it is non-trivial. And is never goine to come even remotely close to what we can do on a fully-shared memory system.

There is significant interaction between different root moves, and this continues to grow as the game moves toward the endgame. Fine 70 is probably the ultimate type position where this shows up best (although anyone can search this to 40+ plies instantly). But searching each root move on a different node greatly delays the depth at which you find the winning move Kb1, usually forcing this all the way out to ply 26 when most find it at 18-20 or so.
Do you have some exact data about this?

I agree that this issue will get more serious as pieces are traded and that in Fine 70 it will be a big deal.

Vas
I don't at the moment, but hang on for a couple of months. My first cluster approach is not going to deal with hashing, it will first deal with effective splitting approaches. I will be able to run this test then and can report on what I gain, for example, when using 8 threads on a single node (same splitting idea of course, and still using message passing to split, but where I can still share the entire hash table) as opposed to running the same way where I don't share the table...

However, I have looked at lots of such data in the past as various people did distributed search. Schaeffer was one that did this on Sun workstations. But I don't remember the results and I lost all of my old data back in 1995 or so..
I'd like to see the different hashing effects somehow disentangled (ie. hash moves, transpositions close in the tree, and transpositions far in the tree).

Vas
I'm not quite sure I am interpreting that correctly. Do you mean test with a global hash for best move, but not use scores/bounds? and then complete hash entries, but only available either (a) near the root or (b) everywhere???