I'm not sure if anyone of you knows this:
http://citeseerx.ist.psu.edu/viewdoc/su ... .1.23.3861
As far as I could gather, this is extremely close to UCT. They have results for parallelized Crafty
I never tried this myself, though.
Clustering etc. thread
Moderators: hgm, Rebel, chrisw
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
Re: An idea for a new WCCC format - what do you think?
I think you forgot that the bunches of 8 cpus share memory. So for one store, that's only 70 broadcasts. One to each shared memory. You do have 560 search threads broadcasting to 70 clients each.bob wrote: I have 70 nodes, so I have 70 low-level splitters running at the same time. On each node, I have 8 cpus, and each one of those will be storing stuff into the hash, many at low levels. And they will need to broadcast. 70 x 8 = 560.
Nobody said scaling was gonna be easy
Yes. My initial implementations had problems in endgames...you can guess why.and I suspect this is going to have to be dynamically adjusted during the game anyway as the tree goes from shallow/bushy to deep/narrow.
Isn't that the most fun part?Since I haven't done any of that in the past, it is going to be a time for pure experimental science it seems...
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
Re: An idea for a new WCCC format - what do you think?
Not at all. All current GPU's have bad behaviour with branches. Avoiding branches in chess code is difficult if not unreasonable. Nobody has a good working implementation so far, and with reason!M ANSARI wrote: It seems that VGA GPU would be perfect for generating MC output at a very fast rate ...
I think it's safe to say that with current GPU hardware, nobody is going to be able to produce a strong program. And that's because the hardware sucks.
It's not. Vas, you're wrong: strong MC (meaning, with tree search) doesn't parallelize any easier than alpha-beta. I've given some hints in my discussion with Bob about why.Vas also mentioned that MC is very easy to get good parralellization
Last edited by Gian-Carlo Pascutto on Thu Mar 12, 2009 11:11 am, edited 1 time in total.
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
Re: Results from UCT parallelization
http://www.cs.unimaas.nl/g.chaslot/pape ... elMCTS.pdfVasik Rajlich wrote:
What were the UCT speedups for tree splitting and multiple runs?
It is quite possible that our serial searches are somehow too "structured" and not soft/stochastic enough. It is a little insane to build alpha-beta trees around 30-move main lines.
Vas
-
- Posts: 1243
- Joined: Sat Dec 13, 2008 7:00 pm
Re: An idea for a new WCCC format - what do you think?
That's not a fair test. When you disable those hash stores, it also affects the searches on the local nodes. I'm sure the performance loss is an order of magnitude then, no need to test. But that's not the same as not having all info from the other nodes available.Vasik Rajlich wrote: In practice, sharing high-depth entries doesn't really change the situation that much. You want the full search state for whatever part of the tree you are searching or your performance will plummet. Try disabling the lowest-depth 98% of your hash writes in Sjeng and see what happens.
Vas
I do agree there is probably a large performance gain for somehow distributing searches to nodes that have most info about them (and root splitting make this easy). My current implementation doesn't do this, distributing searches to nodes is essentially on an as-they-come-available basis.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An idea for a new WCCC format - what do you think?
Why do you think that? I don't see any reason for it to "change". It won't be nearly as efficient as a shared-memory shared-everything search on a shared-memory box, but I don't see why it would degrade for any reason while the search is in progress, any more than a normal threaded search would degrade, which I don't see happen in my case...Vasik Rajlich wrote:My problem with this kind of algorithm is that it will start to choke when the search tree starts to change, which is exactly when you want to be efficient.bob wrote:But my current cluster approach (good way from working yet however) is to split everywhere (although not too far from the root due ot the overhead of a cluster split even with infiniband interconnect) but try to repeat with respect to which node searches which position so that at least the hash info they have will be as helpful as possible, even though it is way less efficient than a pure shared-memory "share everything" search.Vasik Rajlich wrote:That's right. You'll need to consider whether you want to change the shape of your search tree to maximize the stability of the work distribution (ie. the degree to which the same cluster units handle the same parts of the search tree).bob wrote:That's what I thought (scout search).Vasik Rajlich wrote:Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.bob wrote:What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.Vasik Rajlich wrote:Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.
I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).
Vas
We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).
I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.
Vas
ps. I use the term "scout search" to refer to any search with a zero-width window.
The way I am looking at splitting on a cluster is to use a "split hash table". I store the hash signature and node ID when I split, so that the same node searches the same sub-tree each time to take advantage of the hash, killers, etc. There are issues with this as the next iteration can greatly alter the shape of the tree and give one node a set of moves that are much harder to search than others, producing an imbalance. I'm working on sorting out the details for this kind of stuff as the code gets written...
This is where root splitting is special. Root moves have the unique property that they are guaranteed to require searching.
Of course, root splitting alone isn't the answer.
Vas
I'm working on the "two-level" approach, although I am also considering the more complicated issues encountered on a cluster that has NUMA (AMD) nodes. Because you also have issues on a single node dealing with trying to maximize local memory accesses...
Vas
I plan on using the exact same search I currently use. Except that when I split, I look at how close to the root the search currently is, and if within N plies, it will split across nodes, if > N plies, it will do a normal SMP-type split and work on the current node...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An idea for a new WCCC format - what do you think?
No it isn't. There is a big difference between searching deeply to find the best move, and doing a shallow search terminated by a bunch of very fast games to try to predict the quality of that branch. This doesn't probe anywhere near as deeply as the original search the original search.Gian-Carlo Pascutto wrote:MC analysis is just the evaluation function, or the quiescent search, if you wish. I asked a question about nodes explored and you give a (flawed) argument that the evaluation is not the same.bob wrote:Not quite the same thing IMHO. While we are certainly reducing, we are not doing any sort of MC analysis (which is based on very fast game speeds). (...)Gian-Carlo Pascutto wrote: What's the difference between UCT with a tiny exploration term (=current top go programs) and an alpha-beta search with very heavy LMR (=current top chess programs), in terms of nodes explored?
MC analysis with a varying number of samples is equivalent to an evaluation function with lazy exits, from the point of view of the tree search.
Really? What is LMR you think? You're probabilistically reducing moves besides the first (few), on the observation that in general your move ordering is likely to be correct. How is this different from UCT favoring the PV and exploring the alternates less?And growing the tree based on a sort of probability model. In chess, at least my program, I am not doing any probabilistic analysis,
LMR is simply a methodology to not look at some moves as deeply as others. I do zero probability analysis to make the decisions. A move is reduced or not based on a simple static rule. Significantly different IMHO. It is different based on the way UCT decides what to explore.
Exactly as UCT does.I am just dynamic/static criteria to decide what looks interesting to search and what appears to be almost useless. But rather than giving up on a move as useless, it just gets reduced in depth to decrease effort required to dismiss it.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An idea for a new WCCC format - what do you think?
I think the truth lies somewhere in the middle. Each individual search thread will broadcast to the other 70 nodes, not 560 as I had originally mentioned. But with 8 threads on a node, the number of sends will still be around 552 per unit of time since each of the 8 threads on a node can broadcast to the other 69 nodes. So it is a high-traffic deal, still...Gian-Carlo Pascutto wrote:I think you forgot that the bunches of 8 cpus share memory. So for one store, that's only 70 broadcasts. One to each shared memory. You do have 560 search threads broadcasting to 70 clients each.bob wrote: I have 70 nodes, so I have 70 low-level splitters running at the same time. On each node, I have 8 cpus, and each one of those will be storing stuff into the hash, many at low levels. And they will need to broadcast. 70 x 8 = 560.
Nobody said scaling was gonna be easy
Yes. My initial implementations had problems in endgames...you can guess why.and I suspect this is going to have to be dynamically adjusted during the game anyway as the tree goes from shallow/bushy to deep/narrow.
Isn't that the most fun part?Since I haven't done any of that in the past, it is going to be a time for pure experimental science it seems...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An idea for a new WCCC format - what do you think?
I have that code written already. I use hash signature/node-id pairs, so that when I have something to split/distribute, I know where it was sent the last time and I send it back there again to hope that there is good info in the hash still (not a given as that node could have searched other stuff that blew the info out.Gian-Carlo Pascutto wrote:That's not a fair test. When you disable those hash stores, it also affects the searches on the local nodes. I'm sure the performance loss is an order of magnitude then, no need to test. But that's not the same as not having all info from the other nodes available.Vasik Rajlich wrote: In practice, sharing high-depth entries doesn't really change the situation that much. You want the full search state for whatever part of the tree you are searching or your performance will plummet. Try disabling the lowest-depth 98% of your hash writes in Sjeng and see what happens.
Vas
I do agree there is probably a large performance gain for somehow distributing searches to nodes that have most info about them (and root splitting make this easy). My current implementation doesn't do this, distributing searches to nodes is essentially on an as-they-come-available basis.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An idea for a new WCCC format - what do you think?
I'm thinking 8 nodes to try to control the elapsed time of the game so I can play enough of them to draw a reasonable conclusion.Vasik Rajlich wrote:Yes, that would be really good. You don't need 8 nodes, just do the root moves in sequence on one node.bob wrote:I suppose I could run this test easily enough. I could use 8 nodes and only split at the root, and modify how the search works so that each node just keeps searching (unsynchronized) until time expires.Vasik Rajlich wrote:Yes, exactly. I could not have said it better. This is even (slightly) better than the experiment I suggested above. Uri, you should build a cluster.Uri Blass wrote:1)With late move reductions even searching to the same depth can be practically different depth so if you cannot compare different depth then you also cannot compare the same depth.bob wrote:I don't see where this nonsense comes from. Two moves, different scores, equal depths. I most certainly can compare them. And choose the largest. But if the depths are different, and the scores are different, I can only flip a coin to choose one. Choosing the best is wrong. Choosing the deepest depth is wrong. Choosing one of the two moves with different depths is just a purely random choice.Uri Blass wrote:Of course the score may change but by the same logic you also cannot compare between scores at the same depths because the program may change its mind in later depth.bob wrote:There is absolutely no doubt you are wrong. How many times have you seen a program search a move to depth N with a + score, then at depth n+1 the score drops off the map and it switches to a new move? If you believe what you wrote, why not stop your search when the eval gets to a good point? Why continue to search deeper?Uri Blass wrote:I can certainly choose and the simplest choice is to choose higher score is better.bob wrote:You are badly misusing terminology.Uri Blass wrote:No program today have search that is good enough but in theoryVasik Rajlich wrote:Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.bob wrote:Here's what you need to make that happen.Vasik Rajlich wrote:By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doingbob wrote:Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)Vasik Rajlich wrote:The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.Uri Blass wrote:I read this post and I can say 2 things.Dirt wrote:There is something of an explanation here.Vasik Rajlich wrote:Where did that come from ??bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Vas
1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.
It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.
2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).
Uri
Now he's up to 9 nodes BTW
Vas
Vas
(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.
That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.
(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.
I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either.
Vas
if you have a good search the speed up may be smaller from splitting at the root for the simple reason that the value of 1.5x speed improvement is bigger than the value of one ply with no pruning.
I think that
some type of bad evaluation also can cause the speed up to be smaller
or even negative.
Imagine that you search with no pruning and extensions and no qsearch and imagine that you have an evaluation that gives no bonus for the side to move so practically you often evaluate even depth as better than odd depths(maybe it should be the opposite and I did not think about it but the idea is the same)
If you split at the root you may get depth 7 for move A and depth 8 for move B and prefer move B not because it is better but only because you searched to even depth.
This problem does not happen without splitting at the root because without splitting at the root you always get the same depth for all moves.
Uri
1. "splitting at the root" does not mean each root move gets searched to different depth. I split at the root in current crafty. that only means that I do a parallel search on the root moves as well as deeper moves, because this is a far more efficient way to search.
2. You are using what is called "unsynchronized search" where each node searched a different move (or set of moves) at the root, and when iteration N is done, N+1 is started without regard for how the other moves are progressing on other nodes.
this is useless.
How can you choose between depth 21, eval=+1.3, and depth 19, eval +2.5?? You can't. This has been tried in the past, by Newborn, by Schaeffer, and by others. There is no way to compute any sort of equivalence function so that you can decide which of the above is better. The depth 19 move might be even higher at depth 21. Or it might be way lower. The only way to discover this is to search both moves to the same depth. Anything else is beyond hopeless and is a coin flip. You can't even choose between depth 21, +1.5, and depth 19, -2.0, because by depth 21 the -2.0 score might be +5.0...
depth 19 eval=+2.5 is better than depth 21 eval=+1.3 for the simple reason that 2.5>1.3
I believe that with relatively good evaluation it is not so bad choice(not that splitting at the root is good but it may give effective speed improvement of 2 or more than 2 assuming that all root moves are searched).
I may be wrong and the only way to know is by testing.
Uri
As far as a speedup of > 2, it simply will not/can not happen. Amdahl's law says that the overall speedup for an algorithm is:
speedup = time(N) / T(P)
T(P) = sequential_processing_time + parallel_processing_time / N
where N = number of processors.
If the first move takes 50% of the time, and that is the usual case, then you absolutely can not get a speedup > 2.0. It is impossible. Except for those cases where the first move is not best. Now you search the second move. And it takes the same amount of time. So you do get a bigger boost there. But that only happens about 15% of the time. 85% of the time the speedup is going to be way below 2.0 no matter how many CPUs you use. And there is simply no way to escape the basic sequential property of the alpha/beta algorithm.
I trust the search to at least search the two moves in an equal way, so that the resulting backed-up scores can be compared. That's what minimax and alpha/beta is all about.
You do not know the best move and you always guess based on information and hope to be right.
It is still the case today. And I am amazed you don't understand why...
If it is not the case then you cannot also compare scores at the same depth because practically one move is searched to normal depth and another move is searched to smaller depth because of late move reductions.
It may be bad to compare score in different depth with
bad evaluation and no reductions but it is not the practical case(maybe it was the case 20 or 30 years ago).
Uri
2)I think that an interesting experiment that everybody can do may be to test A against B when A search at fixed number of nodes without splitting the root(let say 100,000 nodes per move) and B search after every root move at 50,000 nodes per move.
B choose a move simply by comparing the scores without caring about the fact that depths are different.
I do not know if A wins or B wins and it may be dependent on the program but I see no reason to assume that it is obvious that B cannot win.
Assuming nodes are nearly proportional to time if
B wins then it suggests that splitting at the root can give speed up of more than 2:1.
Uri
Of course, nobody is arguing that the best algorithm is pure root splitting. It's a very important building block, though. And data on this phenomenon (which I actually don't have) would be very interesting.
Vas
This is on my to-do list when the current cluster testing is completed.Just clear the hash between root move searches.
And what I get would not be doable on a cluster because I would still have a fully shared hash.
Even better, try it both ways (with hash clearing & without). The difference will also be interesting.
Vas