parallel search (technical)

bob · Post by **bob** » Wed Jan 28, 2009 11:34 pm

I've been looking at one specific issue and thought I would solicit other opinions. Not the "your search sucks" kind of nonsense, so if that is your inclination, please don't participate. you know who you are.

Here is the issue:

I had started to work on an "auto-tune" facility for Crafty's parallel search. There are some tunable parameters such as how close to the depth=0 nodes I will allow a split (the farther out you go, the more splits you do, but there is less work for each processor to do, so overhead goes up with respect to actual search work done). I had found a fairly reasonable number that has worked pretty well on every box I have tried, but I had also noticed recently that in some endgames, this number was not very good and I sometimes saw parallel search speed (nps, not speedup) drop drastically in some pathological positions (and without using any EGTBs either).

I had originally thought that I might do something like the "bench" command so that one could start crafty on some arbitrary machine, set the max threads to the appropriate value then enter "smptune" and wait for a few minutes. Crafty would vary the tuning parameters and once it found the best ones, would add them to the front of the .craftyrc/crafty.rc file.

However, as I was playing around with code, I needed some positions and I decided to first take two that give my parallel search the most problems due to lots of best move changes, or a very narrow endgame position. And I soon discovered that there was no "one size fits all" for these parameters, so I decided to instead, try to do this "on the fly" so that as the search progresses, it automatically tunes itself so that as it goes toward an endgame, it adjusts the parameters automatically so that the endgame is handled more efficiently.

But I am not having a lot of luck in finding a way to do this that works across a range of machines.

Here are two simple examples. I am giving one line of output for each test that was applied to two positions. Position 1 is a middlegame position. This was run on a dual quad-core 2.33ghz core-2 box.

Code: Select all

mindepth = 4  nodes/split =  1566  time=22.45    nps=16.4M
mindepth = 4  nodes/split =  1684  time=56.77    nps=16.5M
mindepth = 4  nodes/split =  1883  time=41.44    nps=16.6M
mindepth = 4  nodes/split =  1948  time=37.42    nps=16.7M
mindepth = 5  nodes/split =  4247  time=59.01    nps=17.2M
mindepth = 5  nodes/split =  4283  time=51.26    nps=17.1M
mindepth = 5  nodes/split =  4267  time=1&#58;12    nps=17.2M
mindepth = 5  nodes/split =  3816  time=1&#58;11    nps=17.1M
mindepth = 6  nodes/split =  9374  time=33.51    nps=16.9M
mindepth = 6  nodes/split =  6376  time=49.37    nps=16.6M
mindepth = 6  nodes/split =  8151  time=32.65    nps=16.8M
mindepth = 6  nodes/split =  9280  time=1&#58;07    nps=17.4M

'mindepth' is the value I am testing here. I had determined that 4, 5 and 6 are workable. Less than 4 and overhead goes up too high, more than 6 and processors start sitting idle too much.

The two metrics of interest, in order of importance, are "time" and then "nps".

This position is unstable so I ran each different value of mindepth 4 times so that I could get a rough idea of the average. If you look at the above, 4 looks best, 6 is next, and 5 is last. More runs might change that a bit. Next is the NPS, which for the above are all pretty close. below 4 or above 6 the NPS starts to drop, and the time starts to climb. So I am showing the "sweet spot".

The "nodes/split" is simply the number of splits (and splits is defined differently here than in past versions of Crafty, in an attempt to normalize this to the number of CPUs on the box being tested). Here, if I have 4 idle processors and split the tree at some point, I add 4 to splits, because I do the work of allocating 4 split blocks, copying the necessary data to those blocks, then later merging things back into the parent split block to continue the search. I had hoped that this would let me use a similar algorithm for 2 cpu boxes or 16 cpu boxes, but it doesn't appear to be so good.

Here are the same tests run on my office dual PIV/2.8ghz box:

Code: Select all

mindepth = 4  nodes/split = 87246  time=45.62    nps=1.7M
mindepth = 4  nodes/split = 61456  time=39.42    nps=1.8M
mindepth = 4  nodes/split = 62085  time=28.39    nps=1.8M
mindepth = 4  nodes/split = 165044  time=1&#58;16    nps=1.7M
mindepth = 5  nodes/split = 229514  time=1&#58;17    nps=1.8M
mindepth = 5  nodes/split = 103200  time=46.45    nps=1.7M
mindepth = 5  nodes/split = 69432  time=25.39    nps=1.7M
mindepth = 5  nodes/split = 191910  time=1&#58;29    nps=1.8M
mindepth = 6  nodes/split = 155973  time=45.22    nps=1.7M
mindepth = 6  nodes/split = 141028  time=46.59    nps=1.7M
mindepth = 6  nodes/split = 260110  time=1&#58;12    nps=1.6M
mindepth = 6  nodes/split = 150657  time=45.34    nps=1.8M

The main point is that the "nodes per split" value is much higher on the dual than on the 8-way box. I had hoped to use that as a starting point and say "if nodes / split is too low, increase mindepth to do fewer splits. If it is too high, decrease mindepth to do more splits. But as you can see, there is no way to compare the two boxes, which means total nodes divided by total splits is not a useful metric.

Before I go on, here is the endgame position that is more trouble, first on the 8-core box:

Code: Select all

mindepth = 10  nodes/split =   142  time=9.00    nps=6.5M
mindepth = 10  nodes/split =   209  time=27.99    nps=8.0M
mindepth = 10  nodes/split =   193  time=14.21    nps=7.6M
mindepth = 10  nodes/split =   186  time=24.61    nps=7.6M
mindepth = 14  nodes/split =  1360  time=6.69    nps=9.6M
mindepth = 14  nodes/split =  1315  time=5.88    nps=9.6M
mindepth = 14  nodes/split =  1265  time=5.22    nps=9.3M
mindepth = 14  nodes/split =  1719  time=10.18    nps=10.3M
mindepth = 18  nodes/split =  7458  time=6.44    nps=8.4M
mindepth = 18  nodes/split = 15357  time=20.36    nps=9.6M
mindepth = 18  nodes/split =  9637  time=9.65    nps=8.1M
mindepth = 18  nodes/split = 10594  time=8.24    nps=8.9M
mindepth = 22  nodes/split = 110309  time=18.56    nps=6.6M
mindepth = 22  nodes/split = 89241  time=11.09    nps=7.4M
mindepth = 22  nodes/split = 95615  time=15.97    nps=8.6M
mindepth = 22  nodes/split = 83646  time=16.57    nps=7.3M

Again, I am just showing the "sweet spot". mindepth=4 drops the nps to 3.0M or so which is really bad. The best value is around 14-18 with a pretty good NPS and a short search time. But notice the nodes per split values are way higher than optimal MG values...

On my dual PIV:

Code: Select all

mindepth = 10  nodes/split = 15620  time=9.96    nps=2.8M
mindepth = 10  nodes/split = 14749  time=9.50    nps=2.7M
mindepth = 10  nodes/split = 17587  time=12.74    nps=2.8M
mindepth = 10  nodes/split = 20643  time=12.85    nps=2.9M
mindepth = 14  nodes/split = 67004  time=17.72    nps=2.9M
mindepth = 14  nodes/split = 36086  time=11.20    nps=2.7M
mindepth = 14  nodes/split = 29248  time=9.13    nps=2.8M
mindepth = 14  nodes/split = 33755  time=10.47    nps=2.7M
mindepth = 18  nodes/split = 223892  time=17.36    nps=2.7M
mindepth = 18  nodes/split = 120295  time=11.48    nps=2.6M
mindepth = 18  nodes/split = 249113  time=25.99    nps=2.7M
mindepth = 18  nodes/split = 117509  time=16.36    nps=2.5M
mindepth = 22  nodes/split = 440457  time=18.59    nps=2.1M
mindepth = 22  nodes/split = 808569  time=30.93    nps=2.5M
mindepth = 22  nodes/split = 737384  time=30.79    nps=2.4M
mindepth = 22  nodes/split = 502934  time=18.88    nps=2.3M

Again the nodes per split are way higher than for the 8-core box, and the "sweet spot" is similar.

I am trying to find some "metric" that I can use across both these machines (plus a couple of quads I have as well) so that this can "self-tune" as the search runs.

But so far I have not found a good "metric" that can somehow be normalized across the processors so that I can try to "bracket" that metric and if the metric goes "outside" the window, I can adjust mindepth accordingly. There is another parameter or two that I also want to adjust, but the effect is somewhat similar and I have omitted that data to keep this simple. I can limit how many processors split at any one point. I have been using 4 because that works well on most machines and is particularly effective on quad-core boxes since all cores may share either some cache, or in the case of AMD local memory that is faster than remote memory to access.

The question is, what sort of measure might work across different CPU speeds, different numbers of CPUs, in a uniform way? I had hoped that nodes per split would work, but that looks almost useless. I had toyed with normalizing this to 1M nodes per second so that if you go 10x faster, you can get away with 1/10th the nodes per split, but I could not come up with any rational explanation why that looks decent with the current data, so that I could feel some confidence that it would work for all machines.

Any ideas???

One reason for not splitting very close to the tips, is that in addition to doing way more splits and incurring the overhead (however small it may be) I would also incur the overhead of worse move ordering since positions right at the tips likely have no hash entries to help with move ordering, things are less stable near the leaves anyway and as move ordering goes bad, parallel search overhead goes up... That is one reason I split at the root after the parallel search has slowly worked its way back to the root and reported a real root score, as at the root there is _zero_ overhead to speak of, all we lose out on is a few hash hits here and there...

One idea I have been looking at is to simply try to find the lowest setting that produces a good NPS value, as that would sort of do the trick. But it is not that easy to detect that between iterations without stepping "over" the sweet spot, something I don't want to do.

However, perhaps that might well be the best answer for this.

diep · Post by **diep** » Wed Jan 28, 2009 11:58 pm

Hi Bob,

This is actually not a bad idea seen from generic viewpoint. You'll need a lot more statistics first. One important statistic is drawing a graph how many splits you get a second against the total search time.

The real problem in Diep at supercomputer, which is probably comparable to your problem at a few cores, is that the first few seconds of search a HUGE number of splits occur.

So i worried back in 2002/2003 how to solve this problem.

Get a machine where you feel you did do a good splitting depthleft job.
then make a graph of how many splits per second against search time,
Then see whether applying automatically that graph at other hardware benefits crafty.

Note that Diep also has different minimum splitdepths. Ranging from 2 (default) to 10 or so.

But i'm doing it total different than you.

Vincent

krazyken · Post by **krazyken** » Thu Jan 29, 2009 2:10 am

1st I notice is that there appears a high variation in nodes, and it seems that splits is tightly correlated to nodes, as well as time (of course more data would be useful to verify), So I don't think nodes per split will give you more info than you get from nps.

I don't believe you stated the conditions of your test, was it fixed depth? or something else?

I would suspect the value you would want to minimize is time to solution.

bob · Post by **bob** » Thu Jan 29, 2009 2:30 am

diep wrote:Hi Bob,

This is actually not a bad idea seen from generic viewpoint. You'll need a lot more statistics first. One important statistic is drawing a graph how many splits you get a second against the total search time.

The real problem in Diep at supercomputer, which is probably comparable to your problem at a few cores, is that the first few seconds of search a HUGE number of splits occur.

I don't have this problem at all, which is a problem in itself, as I can't really tell much about the search overhead for the first 10 plies or so, and once I get beyond that point, it is difficult to adjust things quickly enough. I have something (now) that works if we start at the beginning of the game, because as things slow down in endings, it adjusts upward just fine. But if I start off in an endgame, it is more problematic because it could be that the search controls need to go up or down, and once I get into "real searches" I don't have a lot of time to "hunt around" without killing things while I do so...

So i worried back in 2002/2003 how to solve this problem.

Get a machine where you feel you did do a good splitting depthleft job.
then make a graph of how many splits per second against search time,
Then see whether applying automatically that graph at other hardware benefits crafty.

Already done that, and the answer is "no". Which is a problem in that I want this to work across any platform without requiring platform specific information for tuning.

Note that Diep also has different minimum splitdepths. Ranging from 2 (default) to 10 or so.

But i'm doing it total different than you.

Vincent

mine vary from 4 to even 20+ in positions like fine70...

bob · Post by **bob** » Thu Jan 29, 2009 2:31 am

krazyken wrote:1st I notice is that there appears a high variation in nodes, and it seems that splits is tightly correlated to nodes, as well as time (of course more data would be useful to verify), So I don't think nodes per split will give you more info than you get from nps.

I don't believe you stated the conditions of your test, was it fixed depth? or something else?

I would suspect the value you would want to minimize is time to solution.

First, yes these were fixed depth runs.

Time to solution is easy to optimize by hand, but I am trying to make this happen as a game progresses so that the search tunes itself during the game as we slowly transition from bushy shallow trees to narrow deep trees. And in a game I have no way to measure "time to solution".

krazyken · Post by **krazyken** » Thu Jan 29, 2009 2:55 am

bob wrote:
krazyken wrote:1st I notice is that there appears a high variation in nodes, and it seems that splits is tightly correlated to nodes, as well as time (of course more data would be useful to verify), So I don't think nodes per split will give you more info than you get from nps.

I don't believe you stated the conditions of your test, was it fixed depth? or something else?

I would suspect the value you would want to minimize is time to solution.
First, yes these were fixed depth runs.

Time to solution is easy to optimize by hand, but I am trying to make this happen as a game progresses so that the search tunes itself during the game as we slowly transition from bushy shallow trees to narrow deep trees. And in a game I have no way to measure "time to solution".

I didn't say it had to be the best solution, when playing the game you do come up with a "solution".

In this case it might want to be minimizing time to depth. Interesting that the average time to depth for your PIV is nearly the same as for your dual-quad. Maybe you aren't going deep enough.

jwes · Post by **jwes** » Thu Jan 29, 2009 4:06 am

Could you use the data from earlier iterations to calculate the numbers for later iterations for each search? That would automatically tune for different computers and positions.

bob · Post by **bob** » Thu Jan 29, 2009 7:19 am

krazyken wrote:
bob wrote:
krazyken wrote:1st I notice is that there appears a high variation in nodes, and it seems that splits is tightly correlated to nodes, as well as time (of course more data would be useful to verify), So I don't think nodes per split will give you more info than you get from nps.

I don't believe you stated the conditions of your test, was it fixed depth? or something else?

I would suspect the value you would want to minimize is time to solution.
First, yes these were fixed depth runs.

Time to solution is easy to optimize by hand, but I am trying to make this happen as a game progresses so that the search tunes itself during the game as we slowly transition from bushy shallow trees to narrow deep trees. And in a game I have no way to measure "time to solution".
I didn't say it had to be the best solution, when playing the game you do come up with a "solution". In this case it might want to be minimizing time to depth. Interesting that the average time to depth for your PIV is nearly the same as for your dual-quad. Maybe you aren't going deep enough.

I don't see how I could do that unless I re-search the same position and adjust the search controls each time. Different positions have significantly different "time to depth". And what I need to do is adjust the search _between_ iterations, where I have no idea what the final depth would be, nor whether what I actually reach is optimal or not...

Tuning on-the-fly while a game is in progress is the goal, and it is an interesting one. I have something that is beginning to work, the only thing I am tweaking is that I need to occasionally "change something" just to see if the change is better, or worse, so that I can tell if I need to make a change to make things run faster.

bob · Post by **bob** » Thu Jan 29, 2009 7:21 am

jwes wrote:Could you use the data from earlier iterations to calculate the numbers for later iterations for each search? That would automatically tune for different computers and positions.

That is what I am trying to do. But it is not so easy to decide, between iterations, whether to change something, and then to determine whether the change is good or bad. NPS is not the perfect evaluation number. If you look at the data I posted, NPS might be higher but search speed (time to some depth) could be worse... And during the search I can't measure time to depth since I can't take the time to re-search the same position several times...

diep · Post by **diep** » Thu Jan 29, 2009 6:40 pm

bob wrote:
diep wrote:Hi Bob,

This is actually not a bad idea seen from generic viewpoint. You'll need a lot more statistics first. One important statistic is drawing a graph how many splits you get a second against the total search time.

The real problem in Diep at supercomputer, which is probably comparable to your problem at a few cores, is that the first few seconds of search a HUGE number of splits occur.
I don't have this problem at all, which is a problem in itself, as I can't really tell much about the search overhead for the first 10 plies or so, and once I get beyond that point, it is difficult to adjust things quickly enough. I have something (now) that works if we start at the beginning of the game, because as things slow down in endings, it adjusts upward just fine. But if I start off in an endgame, it is more problematic because it could be that the search controls need to go up or down, and once I get into "real searches" I don't have a lot of time to "hunt around" without killing things while I do so...

So i worried back in 2002/2003 how to solve this problem.

Get a machine where you feel you did do a good splitting depthleft job.
then make a graph of how many splits per second against search time,
Then see whether applying automatically that graph at other hardware benefits crafty.
Already done that, and the answer is "no". Which is a problem in that I want this to work across any platform without requiring platform specific information for tuning.

Note that Diep also has different minimum splitdepths. Ranging from 2 (default) to 10 or so.

But i'm doing it total different than you.

Vincent
mine vary from 4 to even 20+ in positions like fine70...

Basically you do 2 statements here that are relevant:

a) i assume we both want to look beyond ICGA limits and scale to more than a handful of cores which benefits the skulltrail machine, a mainboard designed to hold a lot of videocards for number crunching GPGPU, that is the funny thing. You mention clearly you want it to work at different hardware, i read that as: more cores

b) you claim you don't suffer from a splitproblem.
Diep is different from crafty, it scales better. I have done a lot of effort for that. The first problem to solve is to scale well. When searching at a big multiprocessor (NUMA) machine say a core or 64 (beckton quad socket box for example), then in Diep it scales with the hardware. So unlike crafty limiting the number of total splits a second is less relevant than for crafty i'd assume. For Diep as it scales that well, it is enough to have a model that is just taking into account local core considerations.

Vincent

parallel search (technical)

parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)

Re: parallel search (technical)