parallel search (technical)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: parallel search (technical)

Post by bob »

diep wrote:
bob wrote:
diep wrote:Hi Bob,

This is actually not a bad idea seen from generic viewpoint. You'll need a lot more statistics first. One important statistic is drawing a graph how many splits you get a second against the total search time.

The real problem in Diep at supercomputer, which is probably comparable to your problem at a few cores, is that the first few seconds of search a HUGE number of splits occur.
I don't have this problem at all, which is a problem in itself, as I can't really tell much about the search overhead for the first 10 plies or so, and once I get beyond that point, it is difficult to adjust things quickly enough. I have something (now) that works if we start at the beginning of the game, because as things slow down in endings, it adjusts upward just fine. But if I start off in an endgame, it is more problematic because it could be that the search controls need to go up or down, and once I get into "real searches" I don't have a lot of time to "hunt around" without killing things while I do so...



So i worried back in 2002/2003 how to solve this problem.

Get a machine where you feel you did do a good splitting depthleft job.
then make a graph of how many splits per second against search time,
Then see whether applying automatically that graph at other hardware benefits crafty.
Already done that, and the answer is "no". Which is a problem in that I want this to work across any platform without requiring platform specific information for tuning.

Note that Diep also has different minimum splitdepths. Ranging from 2 (default) to 10 or so.

But i'm doing it total different than you.

Vincent
mine vary from 4 to even 20+ in positions like fine70...
Basically you do 2 statements here that are relevant:

a) i assume we both want to look beyond ICGA limits and scale to more than a handful of cores which benefits the skulltrail machine, a mainboard designed to hold a lot of videocards for number crunching GPGPU, that is the funny thing. You mention clearly you want it to work at different hardware, i read that as: more cores
Think "broader". I am talking both about the number of processors (cores) but also about the speed of them. On slow processors I can split nearer to the tips, while on faster processors I back off of that a bit to improve efficiency. So it is about both, and I have yet to find a "one size fits all" scheme to tune this.

I had started working on a new feature, somewhat like "bench" but which would be used to tune the SMP search to a specific platform, and then save the tuning parameters in the .craftyrc file so that the search would always be optimized. But I discovered it was not that easy, because the parameters ideally would vary as the characteristics of the tree change from bushy/shallow to narrow/deep. And doing that automatically is a bit of a problem... I want the search to get better as it tunes, not get much worse before it figures out it is doing the wrong adjustment and backs up...


b) you claim you don't suffer from a splitproblem.
I claim I don't suffer from a split problem at very shallow searches, which I don't. You can easily verify that for yourself if you want. My problem happens as the search goes deeper, and there is, for any tree, a perfect "spot" where you split above the spot, but never below, for optimal performance. However, the "spot" is a shifty little SOB and I am trying to determine where it is, on the fly, for any hardware speed and number of cores and time-per-move limit. It seems to be somewhat akin to wanting to know both the speed of an electron and its physical location, if you are familiar with the concept...
Diep is different from crafty, it scales better. I have done a lot of effort for that. The first problem to solve is to scale well. When searching at a big multiprocessor (NUMA) machine say a core or 64 (beckton quad socket box for example), then in Diep it scales with the hardware. So unlike crafty limiting the number of total splits a second is less relevant than for crafty i'd assume. For Diep as it scales that well, it is enough to have a model that is just taking into account local core considerations.

Vincent
I have run extensively through 32 cores, with good success. I ran some on 64 cores, but it was an Itanium box, and it didn't do so well (a first generation Itanium at that). But even there the scaling was > 50% of optimal so it wasn't horrible, but the architecture was just bad (and this was prior to the lockless hashing days so the locks on the hash table back then were probably murdering it).

But I am not nearly as concerned about the "big end" as I am about simply running optimally on realistic hardware... and that's what I am currently trying to solve so that it works for all platforms, not just the ones I have tested and tuned on...

Once or twice a year I get into a long discussion with someone that tries to run Crafty on something more exotic than usual. Last case was an interesting multi-cpu multi-core opteron box. And it took some testing and tweaking to get the numbers up to where they belonged, and the final result would not run worth a flip on my dual xeon, and were worse than the default when running on the 8-core xeon cluster here.

I want to make that problem go away if possible.
Allard Siemelink
Posts: 297
Joined: Fri Jun 30, 2006 9:30 pm
Location: Netherlands

Re: parallel search (technical)

Post by Allard Siemelink »

Perhaps the solution is to speed up the split operation?
In bright, splitting is cheap and I always allow splits, even for the leaf nodes (though not in qsearch).
On my dual core, I get ~40 nodes/split for fine 70.
Yet, increasing the minimal split depth from 1 to something higher only slows down the search...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: parallel search (technical)

Post by bob »

Allard Siemelink wrote:Perhaps the solution is to speed up the split operation?
In bright, splitting is cheap and I always allow splits, even for the leaf nodes (though not in qsearch).
On my dual core, I get ~40 nodes/split for fine 70.
Yet, increasing the minimal split depth from 1 to something higher only slows down the search...
I suppose it depends on how fast you are searching nodes vs how fast you can split. I have a reasonable amount of data to copy, and on 8 cores, things get fairly hectic splitting near the tips. I know my program can't do a split for just 40 nodes, I can search 40 nodes way quicker than invoking a parallel search...
krazyken

Re: parallel search (technical)

Post by krazyken »

bob wrote:I don't see how I could do that unless I re-search the same position and adjust the search controls each time. Different positions have significantly different "time to depth". And what I need to do is adjust the search _between_ iterations, where I have no idea what the final depth would be, nor whether what I actually reach is optimal or not...

Tuning on-the-fly while a game is in progress is the goal, and it is an interesting one. I have something that is beginning to work, the only thing I am tweaking is that I need to occasionally "change something" just to see if the change is better, or worse, so that I can tell if I need to make a change to make things run faster.
A general theory that may be practical:

If positions can be classified into groups, then you can do a combo of "on the fly" and your bench-like command. You run the bench command first to set base lines for the hardware, testing several positions from each group, storing in a settings table what the numbers look like for each group of settings and marking which setting is optimum. so in the game you are running a certain set of settings to start with, you history data that you can look up in the table to compare the results you are getting with these settings, and find out if they are close to the optimal.

I suppose the hardest part of this might be finding out if a position is in a particular group, the definition I am using is that a group is defined as positions that have the same "result statistics" when running with the same settings. There would need to be a lot of data collection of stats from lots of positions to see if results are clumpy enough to make groups. Your nodes/split does seem clumpy from little data shown so far.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: parallel search (technical)

Post by bob »

krazyken wrote:
bob wrote:I don't see how I could do that unless I re-search the same position and adjust the search controls each time. Different positions have significantly different "time to depth". And what I need to do is adjust the search _between_ iterations, where I have no idea what the final depth would be, nor whether what I actually reach is optimal or not...

Tuning on-the-fly while a game is in progress is the goal, and it is an interesting one. I have something that is beginning to work, the only thing I am tweaking is that I need to occasionally "change something" just to see if the change is better, or worse, so that I can tell if I need to make a change to make things run faster.
A general theory that may be practical:

If positions can be classified into groups, then you can do a combo of "on the fly" and your bench-like command. You run the bench command first to set base lines for the hardware, testing several positions from each group, storing in a settings table what the numbers look like for each group of settings and marking which setting is optimum. so in the game you are running a certain set of settings to start with, you history data that you can look up in the table to compare the results you are getting with these settings, and find out if they are close to the optimal.

I suppose the hardest part of this might be finding out if a position is in a particular group, the definition I am using is that a group is defined as positions that have the same "result statistics" when running with the same settings. There would need to be a lot of data collection of stats from lots of positions to see if results are clumpy enough to make groups. Your nodes/split does seem clumpy from little data shown so far.
It is quite difficult to define the "groups" however. But I am playing with something dealing with the effective branching factor of the tree, which is roughly nodes searched divided by search depth This is actually pretty close to what is the performance issue, since the narrower the tree, the farther from the root I want to split... Going to work on it some more tonight when I get home...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: parallel search (technical)

Post by bob »

bob wrote:
krazyken wrote:
bob wrote:I don't see how I could do that unless I re-search the same position and adjust the search controls each time. Different positions have significantly different "time to depth". And what I need to do is adjust the search _between_ iterations, where I have no idea what the final depth would be, nor whether what I actually reach is optimal or not...

Tuning on-the-fly while a game is in progress is the goal, and it is an interesting one. I have something that is beginning to work, the only thing I am tweaking is that I need to occasionally "change something" just to see if the change is better, or worse, so that I can tell if I need to make a change to make things run faster.
A general theory that may be practical:

If positions can be classified into groups, then you can do a combo of "on the fly" and your bench-like command. You run the bench command first to set base lines for the hardware, testing several positions from each group, storing in a settings table what the numbers look like for each group of settings and marking which setting is optimum. so in the game you are running a certain set of settings to start with, you history data that you can look up in the table to compare the results you are getting with these settings, and find out if they are close to the optimal.

I suppose the hardest part of this might be finding out if a position is in a particular group, the definition I am using is that a group is defined as positions that have the same "result statistics" when running with the same settings. There would need to be a lot of data collection of stats from lots of positions to see if results are clumpy enough to make groups. Your nodes/split does seem clumpy from little data shown so far.
It is quite difficult to define the "groups" however. But I am playing with something dealing with the effective branching factor of the tree, which is roughly nodes searched divided by search depth This is actually pretty close to what is the performance issue, since the narrower the tree, the farther from the root I want to split... Going to work on it some more tonight when I get home...
Found the beginning of a useful piece of data. In the usual alpha/beta formula,

nodes = 2 * W ^ (D/2)

Once I finish a search, I know nodes and D. So I am computing W (the effective branching factor) as


w = (nodes / 2) ^ (2 / D)

W gives me an idea of whether the tree is MG-type bushy/shallow or EG-type narrow/deep.

I'm having Crafty compute this number for every seach done, and am going to see if I manually tune the parallel stuff based on this number for a given position, if I can get close to the optimal value...

So far, the idea shows primise, whether it will actually work out or not is unknown. More later this weekend.

Here's three sample positions. First is middlegame, second is fairly deep endgame, final one is fine #70. EBF for each in order is 9.10, 3.28, and 1.84...

Note that the numbers don't make any real since in terms of absolute value, because the nodes searched includes a ton of q-search nodes that alpha/beta formula does not count, not to mention search extensions and repeat searches for multiple iterations. So the numbers are pretty meaningless. But comparing them gives me a pretty good indication (so far) of how bushy/narrow the tree is, which is the trigger for controlling how I split the tree. So maybe some progress here. As it stands, I might can simply create a vector of length 10, limit the ebf to lie in the interval (0, 9) and index the tuning parameters. But I need a good bit of data to make sure there are not some ugly exceptions that will break... Although the current fixed approach certainly breaks on a few positions...
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: parallel search (technical)

Post by Gian-Carlo Pascutto »

Can't you achieve what you want by requiring that the first node searched (which is always non-split) took at least x nodes before allowing a split?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: parallel search (technical)

Post by bob »

Gian-Carlo Pascutto wrote:Can't you achieve what you want by requiring that the first node searched (which is always non-split) took at least x nodes before allowing a split?
I tried exactly that, among other things based on nodes, but it didn't work out well and seemed sort of "dead" to the effect that changing N (number of nodes) had. There are other things to tune as well, which makes this "ebf-approximation" appear to be a bit better, because as the tree gets narrower, I want to allow fewer processors to split at any one point, while when the tree is bushy, I can allow more... However that might work in concert with the current ebf stuff to provide more information for the auto-tune code...

One other issue I had forgotten about until I looked at the results from testing the above, is that raw nodes is insensitive to depth. What is optimal for searches to depth=20 is not so optimal for depth=14 bullet games. The previous approach was proportional to the depth of the current iteration, but even this was not good enough when the trees get very narrow...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: parallel search (technical)

Post by bob »

Just to finish this up, I have found what I believe is a workable solution for most cases. The NUMA case is still going to be a hand-tuned deal once we really have 8 cores per node and beyond, because that will leave me with 1, 2, 4, 8 and N cores per node. I've not found a good "on-the-fly" way to tune this yet. I am probably going to resort to a command like "bench" for NUMA boxes that tries various tuning options and measures performance to find the best settings.

I've run a lot of tests and am not seeing the significant NPS drop in certain kinds of endgames that I was seeing before, I'm working with someone at intel on a 16-core box (4 chips) that ought to prove whether this works better or not. More on that a little later...