Clustering etc. thread

Gian-Carlo Pascutto · Wed Mar 11, 2009 11:22 pm

bob wrote: In a worst-case on my cluster here, you can have 560 broadcasts per hash store since every processor will need to broadcast when it updates the hash.

I don't understand how you get to 560.

I do agree keeping system overhead low is a problem. At some point you see the OS networking stack using CPU...would be a nice thing for hyperthreading CPUs

Gian-Carlo Pascutto · Wed Mar 11, 2009 11:58 pm

bob wrote: UCT is an interesting idea, (...) Extending that to chess, however, is a different issue entirely and whether it would work here or not is not a given.

However, in the current context, it is irrelevant to the current discussion anyway.

If you claim that, then you can certainly answer the following question:

What's the difference between UCT with a tiny exploration term (=current top go programs) and an alpha-beta search with very heavy LMR (=current top chess programs), in terms of nodes explored?

Gian-Carlo Pascutto · Thu Mar 12, 2009 12:07 am

Zach Wegner wrote:Well it actually can if you use what GCP suggested.
Code: Select all
nodes = 1 << depth; // rough approximation based on ebf=2
prob = 1 / &#40;1 + pow&#40;10, -score/4&#41;); // floating point score, see http&#58;//chessprogramming.wikispaces.com/Pawn+Advantage%2C+Win+Percentage%2C+and+ELO
ucb = score + k*sqrt&#40;log&#40;nodes&#41;/total_nodes&#41;;
best_score = max&#40;ucb&#41;;
The last line (choosing) can use some modification, but that's just a basic formula. GCP is the real expert there...

UCB only tells you what move to explore next (=to search deeper). There's no (mathematical) rule to pick the final move. But as hinted in an earlier post, you can do:

a) Pick the most explored (=deepest searched move). This works because UCB explores the best scoring moves most.
b) Pick the best scoring move. This works because UCB explores good scoring moves a lot, so if they quickly get a lot of depth if their score stays good.

The annoying case is the one I posted earlier, but you get a strong program no matter which of the above you do, though.

You should not just play the move with the best UCB, that's mathematically guaranteed to be a blunder once in a while.

Now, in case of chess, if you split the set of moves equally between CPUs, each move is guaranteed to get pretty much the same effort spent to it. This might result in different depths, but who cares if the effort spent is equal? So, in the clustering-root-unsynchronised case, it seems totally obvious to me that you would just pick the best scoring move, ignoring depth totally.

Gian-Carlo Pascutto · Thu Mar 12, 2009 12:17 am

Zach Wegner wrote:I personally don't think there's that much potential in it, since you would always want to be searching the best evaluated (by UCB) move, not the "best out of the set of moves that we chose for this node at iteration 1 so we can have some hash table consistency".

Correct. I do agree with Bob that just splitting at the root looks like a dead end. I'm not terribly surprised Vasik is claiming better results than could be naively expected, because changing a null-window to an open window search can have some nice side effects, which was also observed by the guys parallelizing UCT. I wonder if Vasik wants to tell us what exactly he does with the windows when distributing moves.

But in the grand scheme of things it (=root splitting) looks like a dead end.

And in a lot of cases you want multiple nodes searching the same root move. I'm going the YBW-like route, though with DTS it gets _extremely_ complicated.

My cluster program also just splits the tree.

bob · Post by **bob** » Thu Mar 12, 2009 12:38 am

Gian-Carlo Pascutto wrote:
Zach Wegner wrote:Well it actually can if you use what GCP suggested.
Code: Select all
nodes = 1 << depth; // rough approximation based on ebf=2
prob = 1 / &#40;1 + pow&#40;10, -score/4&#41;); // floating point score, see http&#58;//chessprogramming.wikispaces.com/Pawn+Advantage%2C+Win+Percentage%2C+and+ELO
ucb = score + k*sqrt&#40;log&#40;nodes&#41;/total_nodes&#41;;
best_score = max&#40;ucb&#41;;
The last line (choosing) can use some modification, but that's just a basic formula. GCP is the real expert there...
UCB only tells you what move to explore next (=to search deeper). There's no (mathematical) rule to pick the final move. But as hinted in an earlier post, you can do:

a) Pick the most explored (=deepest searched move). This works because UCB explores the best scoring moves most.
b) Pick the best scoring move. This works because UCB explores good scoring moves a lot, so if they quickly get a lot of depth if their score stays good.

The annoying case is the one I posted earlier, but you get a strong program no matter which of the above you do, though.

You should not just play the move with the best UCB, that's mathematically guaranteed to be a blunder once in a while.

Now, in case of chess, if you split the set of moves equally between CPUs, each move is guaranteed to get pretty much the same effort spent to it. This might result in different depths, but who cares if the effort spent is equal? So, in the clustering-root-unsynchronised case, it seems totally obvious to me that you would just pick the best scoring move, ignoring depth totally.

OK, let's analyze that to see the implication from a speed / skill point of view.

(1) shallow search produces a good score, deep search produces a bad score. You take the good search. Which says the deeper searches are just thrown away and helped not at all.

(2) shallow search produces bad score, deeper search produces a good score. It's safe to play the better score, but the move searched less deep might well be better. We'll never know, but we did get a good move.

(3) What about the case where the shallow search has a score that is steadily dropping from iteration to iteration, but it is slightly better than the score from the deeper search that is rising iteration to iteration?

My interest in this is to get _all_ the nodes to help with the search so that _all_ moves are searched more deeply than normal. Which means we waste much less time. And get a much better speedup in terms of search space / time. A real parallel search avoids all this junk about different depths and having shallow and deep search scores to compare. Because alpha/beta will handle that by itself, when it is done reasonably. And the speedup will be far better. As will the scalability. This approach on either of my clusters here is totally pointless. One has 128 nodes, the other has 70 nodes. We rarely have that many legal moves at the root so we could not use the entire box, ever. Where a real parallel search could. What kind of speedup I will see is unknown. But if I can get 35 out of 70 I'll be more than happy and I believe that is doable, if we can't do even better. Splitting at the root will rarely reach 2x faster.

bob · Post by **bob** » Thu Mar 12, 2009 12:49 am

Gian-Carlo Pascutto wrote:
bob wrote: UCT is an interesting idea, (...) Extending that to chess, however, is a different issue entirely and whether it would work here or not is not a given.

However, in the current context, it is irrelevant to the current discussion anyway.
If you claim that, then you can certainly answer the following question:

What's the difference between UCT with a tiny exploration term (=current top go programs) and an alpha-beta search with very heavy LMR (=current top chess programs), in terms of nodes explored?

Not quite the same thing IMHO. While we are certainly reducing, we are not doing any sort of MC analysis (which is based on very fast game speeds). IMHO, UCT has always been applicable to those types of games (go comes to mind) where it is impossible to reach enough depth with traditional AB minimax search to play with the best humans. So you make up for the depth by using the MC approach to try to figure out which move has the most potential for winning games based on very fast search times. And growing the tree based on a sort of probability model. In chess, at least my program, I am not doing any probabilistic analysis, I am just dynamic/static criteria to decide what looks interesting to search and what appears to be almost useless. But rather than giving up on a move as useless, it just gets reduced in depth to decrease effort required to dismiss it.

I've been involved in more than a few MC-type experiments, although only one was related to chess and automatic eval tuning. I never saw any useful results in the chess world. possibly because GO is such a different game without different piece types and far more move choices...

Doesn't mean there is no way it can work, I just haven't seen such a case yet. And in the current discussion it is moot anyway as this is not what is being done...

I've been thinking about Vas' suggestion about 10 secs for one program used in a normal search, vs 10 secs per root move. I can even clear the hash table and killers so that there is no unintentional communication between moves, to see just how much better the latter would be. And I am sure it would be better. But this is going to assume N processors where N is a very large number. With Crafty's current search, double the processors adds a ply. And should add about 70 Elo per doubling or so. This hardware "simulation" should produce at least 5 doublings assuming 32 moves average at the root to use 32 cpus. That should be 350 Elo better. I suspect we are talking about 60-70 at most, and that is probably generous. I'll try to write the code to do this and we'll see...

bob · Post by **bob** » Thu Mar 12, 2009 12:57 am

Gian-Carlo Pascutto wrote:
bob wrote: In a worst-case on my cluster here, you can have 560 broadcasts per hash store since every processor will need to broadcast when it updates the hash.
I don't understand how you get to 560.

I do agree keeping system overhead low is a problem. At some point you see the OS networking stack using CPU...would be a nice thing for hyperthreading CPUs

I have 70 nodes, so I have 70 low-level splitters running at the same time. On each node, I have 8 cpus, and each one of those will be storing stuff into the hash, many at low levels. And they will need to broadcast. 70 x 8 = 560. yes, the traditional cluster has one cpu per node, but this is not a traditional cluster. I've run some simulations while planning and writing code and broadcasts were an issue. I had decided that lost packets were irrelevant and did not try to do a time-out / retry in the test code. And I could not determine how deep into the tree I can go before I turn broadcasting off. That will have to wait until I have working code, since it is impossible to estimate for the general case how many hash stores per second happen at depth N in the tree... and I suspect this is going to have to be dynamically adjusted during the game anyway as the tree goes from shallow/bushy to deep/narrow.

I have not decided that this is the best way either. The alternative is to do asynchronous hash table probes, again limited by depth so that nodes near the tips avoid driving the network traffic to insane levels. That is, do a probe, and while waiting on the result either (a) try to fire off the probe as soon as physically possible, such as right after a MakeMove() at the previous ply or (b) start the search here as though it were a miss and if we get a success packet back later, take whatever action is necessary to use that information then.

Again, both of those are possibilities, with a couple of others thrown in for good measure. Others have tried the idea of having a single "probe" thread on all nodes that handle the remote communication, and then truly distributing the hash table by using some set of bits in the signature for the node address where that entry gets stored.

Since I haven't done any of that in the past, it is going to be a time for pure experimental science it seems...

bob · Post by **bob** » Thu Mar 12, 2009 1:10 am

Uri Blass wrote:

1)Maybe there is misunderstanding.
I claim that based on the same logic that say that you cannot compare move at different depth you also cannot compare moves at the same depth.
I did not agree with the logic that you cannot compare moves with different depth

2)depth 19, move A, score = -1.2; depth 21, move B, score = +.7

I prefer move B because the score is better.

The point is this. If you prefer B, what did you get from searching A and spending all that effort on it. In a normal alpha/beta search, you would have spent almost no time on A if B is better, which means you spend the effort on the move that counts, using _all_ processors.

Of course if I can search move A to depth 21 it is better information but I assume that I have no time for it.

There are cases when the score for move A is going to be 0.8 at depth 21 but I cannot know it.

I can also ask you what you prefer
depth 19 move A score=-1.2 depth 19 move B score=+0.7 and if you say move B I can show you cases that the picture is different if you search both to depth 21.

It is going to prove nothing.

3)I did not claim that unsyncronized splitting at the root is good relative to other parallel search techniques.

If you get effective speed improvement of 2.5 by splitting at the root with 100 processors(so you can use one processor for every root move) and get effective speed improvement of 50 by parallel search of today with 100 processors then of course parallel search of today is better.

First, you are not going to get 2.5. Again, there is some very detailed math in a paper I wrote, vetted by some people that know this kind of stuff inside/out. In this particular approach, the term "speedup" is almost meaningless since there is no way to measure speedup in an unsynchronized search that makes a lot of sense. Perhaps in test positions where there is _one_ solution that takes a pretty deep search to find. And in that case, this approach is simply going to suck. And suck very badly. Compared to even the most elementary parallel search approach.

My only claim is that it is not obvious that you cannot get effective speed improvement of more than 2 by unsyncronized splitting at the root and you cannot trust results of cray blitz when modern programs use different search and clearly search deeper.

Here's the test. Fine a group of positions that have a single key move, and which requires a fairly deep search to find. Run the program on one CPU to see how long it takes to find that move. Then run it on N (where N can be a large enough number to search each root move on a different processor. And again measure the time to find that move.

it is _not_ going to average 2.5x faster. Or 2.0X faster. Or even 1.5x faster. Because of the basic math of alpha/beta. Lukas has continually refused to run "problems" on his cluster saying something like "the cluster is for playing games not solving problems..." Which tells me it does poorly at these kinds of tests. And yet nearly every chess position is just another problem position where we need to find the best move...

Maybe comparing odd and even plies with cray blitz caused problems but today when programs get higher depth with all reductions and extensions you often get odd and even plies even if you search to the same depth and I do not understand your comment about having 2 different searchs
for crafty without check extensions.

My claim was that comparing depth A with depth A for default Crafty is equivalent to comparing depth A and depth A+1 for modified Crafty with no check extensions.

Here is the point. For a single iteration in Crafty, everything is searched the same way, same extensions, same reductions, same null-move searches, same evaluation, etc. Every move searched at the root gets extended and reduced in different ways. I agree with that. But the same rules apply to all moves. If I suddenly just search one of the moves to an extra ply, first I burn a significant amount of extra time doing so, and then the question is do I get anything back as a result. The answer is nothing useful within the alpha/beta framework. So even though every move is likely going to reach many different depths at different nodes, all are subject to the same strategy and overall things are reasonably equal in terms of what is done, how deep the branches go, etc.

Saying that comparing depth A and depth A+1 does not make sense for default Crafty and saying that the same idea is logical for modified Crafty
does not seem to me logical.

Uri

M ANSARI · Post by **M ANSARI** » Thu Mar 12, 2009 6:15 am

bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: In a worst-case on my cluster here, you can have 560 broadcasts per hash store since every processor will need to broadcast when it updates the hash.
I don't understand how you get to 560.

I do agree keeping system overhead low is a problem. At some point you see the OS networking stack using CPU...would be a nice thing for hyperthreading CPUs
I have 70 nodes, so I have 70 low-level splitters running at the same time. On each node, I have 8 cpus, and each one of those will be storing stuff into the hash, many at low levels. And they will need to broadcast. 70 x 8 = 560. yes, the traditional cluster has one cpu per node, but this is not a traditional cluster. I've run some simulations while planning and writing code and broadcasts were an issue. I had decided that lost packets were irrelevant and did not try to do a time-out / retry in the test code. And I could not determine how deep into the tree I can go before I turn broadcasting off. That will have to wait until I have working code, since it is impossible to estimate for the general case how many hash stores per second happen at depth N in the tree... and I suspect this is going to have to be dynamically adjusted during the game anyway as the tree goes from shallow/bushy to deep/narrow.

I have not decided that this is the best way either. The alternative is to do asynchronous hash table probes, again limited by depth so that nodes near the tips avoid driving the network traffic to insane levels. That is, do a probe, and while waiting on the result either (a) try to fire off the probe as soon as physically possible, such as right after a MakeMove() at the previous ply or (b) start the search here as though it were a miss and if we get a success packet back later, take whatever action is necessary to use that information then.

Again, both of those are possibilities, with a couple of others thrown in for good measure. Others have tried the idea of having a single "probe" thread on all nodes that handle the remote communication, and then truly distributing the hash table by using some set of bits in the signature for the node address where that entry gets stored.

Since I haven't done any of that in the past, it is going to be a time for pure experimental science it seems...

Bob let me ask you, have you tried the idea of using a "daughter" card to generate Monte Carlo on the fly and thus add possible moves for the nodes to ponder. It seems that VGA GPU would be perfect for generating MC output at a very fast rate ... especially if say 4 super fast VGA cards are in SLI mode. I think a non integer setup can be made to work very quickly for MC unlike for real chess. I have been playing around with MC to analyze some positions and I found it amazingly useful in some difficult positions. When I analyze positions I do notice that sometimes when I force an engine to play a certain line that feels strong, it will very quickly find the correct strong moves once the engine is nodged in that directiron ... if left on its own it will take too long to be practical. My thinking is that generating MC data and dedicating say one or two nodes in a cluster setup could improve the choice moves for a cluster to ponder.

Vas also mentioned that MC is very easy to get good parralellization via distributed non equal strength hardware as depth can be set to be easily reachable for even the weakest hardware, so if you have say 1000 computers connected online, maybe code could be written to send out 1000 different MC tasks to those computers and the output data could then be used to choose one or two moves in a many node cluster.

It just seems that as a cluster gets a huge number of nodes, many of those nodes become redundant and not used optimally. MC sure seems like a shortcut to nudge a cluster to probe obscure yet powerful moves.

bob · Post by **bob** » Thu Mar 12, 2009 6:40 am

M ANSARI wrote:
bob wrote:
Gian-Carlo Pascutto wrote:
bob wrote: In a worst-case on my cluster here, you can have 560 broadcasts per hash store since every processor will need to broadcast when it updates the hash.
I don't understand how you get to 560.

I do agree keeping system overhead low is a problem. At some point you see the OS networking stack using CPU...would be a nice thing for hyperthreading CPUs
I have 70 nodes, so I have 70 low-level splitters running at the same time. On each node, I have 8 cpus, and each one of those will be storing stuff into the hash, many at low levels. And they will need to broadcast. 70 x 8 = 560. yes, the traditional cluster has one cpu per node, but this is not a traditional cluster. I've run some simulations while planning and writing code and broadcasts were an issue. I had decided that lost packets were irrelevant and did not try to do a time-out / retry in the test code. And I could not determine how deep into the tree I can go before I turn broadcasting off. That will have to wait until I have working code, since it is impossible to estimate for the general case how many hash stores per second happen at depth N in the tree... and I suspect this is going to have to be dynamically adjusted during the game anyway as the tree goes from shallow/bushy to deep/narrow.

I have not decided that this is the best way either. The alternative is to do asynchronous hash table probes, again limited by depth so that nodes near the tips avoid driving the network traffic to insane levels. That is, do a probe, and while waiting on the result either (a) try to fire off the probe as soon as physically possible, such as right after a MakeMove() at the previous ply or (b) start the search here as though it were a miss and if we get a success packet back later, take whatever action is necessary to use that information then.

Again, both of those are possibilities, with a couple of others thrown in for good measure. Others have tried the idea of having a single "probe" thread on all nodes that handle the remote communication, and then truly distributing the hash table by using some set of bits in the signature for the node address where that entry gets stored.

Since I haven't done any of that in the past, it is going to be a time for pure experimental science it seems...

Bob let me ask you, have you tried the idea of using a "daughter" card to generate Monte Carlo on the fly and thus add possible moves for the nodes to ponder. It seems that VGA GPU would be perfect for generating MC output at a very fast rate ... especially if say 4 super fast VGA cards are in SLI mode. I think a non integer setup can be made to work very quickly for MC unlike for real chess. I have been playing around with MC to analyze some positions and I found it amazingly useful in some difficult positions. When I analyze positions I do notice that sometimes when I force an engine to play a certain line that feels strong, it will very quickly find the correct strong moves once the engine is nodged in that directiron ... if left on its own it will take too long to be practical. My thinking is that generating MC data and dedicating say one or two nodes in a cluster setup could improve the choice moves for a cluster to ponder.

Vas also mentioned that MC is very easy to get good parralellization via distributed non equal strength hardware as depth can be set to be easily reachable for even the weakest hardware, so if you have say 1000 computers connected online, maybe code could be written to send out 1000 different MC tasks to those computers and the output data could then be used to choose one or two moves in a many node cluster.

It just seems that as a cluster gets a huge number of nodes, many of those nodes become redundant and not used optimally. MC sure seems like a shortcut to nudge a cluster to probe obscure yet powerful moves.

I've not paid a lot of attention to GPUs. For the very reason that I decided to get out of the exotic hardware stuff back in 1995. We have one GPU project here with a couple of students working with folks out at Sandia lab to do a ultra-high performance RAID device based on GPU. They've already produced over 1 gigabyte per second including the redundancy calculations. Orders of magnitude cheaper than the ultra-high-performance custom RAID stuff you can buy.

MC may well become a good approach for large clusters. But I believe that for clusters like our 70 node x 8 core cluster, one can harness all of that into a normal alpha/beta search with much more exact results...

Clustering etc. thread

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?

Re: An idea for a new WCCC format - what do you think?