Here is the issue:
I had started to work on an "auto-tune" facility for Crafty's parallel search. There are some tunable parameters such as how close to the depth=0 nodes I will allow a split (the farther out you go, the more splits you do, but there is less work for each processor to do, so overhead goes up with respect to actual search work done). I had found a fairly reasonable number that has worked pretty well on every box I have tried, but I had also noticed recently that in some endgames, this number was not very good and I sometimes saw parallel search speed (nps, not speedup) drop drastically in some pathological positions (and without using any EGTBs either).
I had originally thought that I might do something like the "bench" command so that one could start crafty on some arbitrary machine, set the max threads to the appropriate value then enter "smptune" and wait for a few minutes. Crafty would vary the tuning parameters and once it found the best ones, would add them to the front of the .craftyrc/crafty.rc file.
However, as I was playing around with code, I needed some positions and I decided to first take two that give my parallel search the most problems due to lots of best move changes, or a very narrow endgame position. And I soon discovered that there was no "one size fits all" for these parameters, so I decided to instead, try to do this "on the fly" so that as the search progresses, it automatically tunes itself so that as it goes toward an endgame, it adjusts the parameters automatically so that the endgame is handled more efficiently.
But I am not having a lot of luck in finding a way to do this that works across a range of machines.
Here are two simple examples. I am giving one line of output for each test that was applied to two positions. Position 1 is a middlegame position. This was run on a dual quad-core 2.33ghz core-2 box.
Code: Select all
mindepth = 4 nodes/split = 1566 time=22.45 nps=16.4M
mindepth = 4 nodes/split = 1684 time=56.77 nps=16.5M
mindepth = 4 nodes/split = 1883 time=41.44 nps=16.6M
mindepth = 4 nodes/split = 1948 time=37.42 nps=16.7M
mindepth = 5 nodes/split = 4247 time=59.01 nps=17.2M
mindepth = 5 nodes/split = 4283 time=51.26 nps=17.1M
mindepth = 5 nodes/split = 4267 time=1:12 nps=17.2M
mindepth = 5 nodes/split = 3816 time=1:11 nps=17.1M
mindepth = 6 nodes/split = 9374 time=33.51 nps=16.9M
mindepth = 6 nodes/split = 6376 time=49.37 nps=16.6M
mindepth = 6 nodes/split = 8151 time=32.65 nps=16.8M
mindepth = 6 nodes/split = 9280 time=1:07 nps=17.4M
The two metrics of interest, in order of importance, are "time" and then "nps".
This position is unstable so I ran each different value of mindepth 4 times so that I could get a rough idea of the average. If you look at the above, 4 looks best, 6 is next, and 5 is last. More runs might change that a bit. Next is the NPS, which for the above are all pretty close. below 4 or above 6 the NPS starts to drop, and the time starts to climb. So I am showing the "sweet spot".
The "nodes/split" is simply the number of splits (and splits is defined differently here than in past versions of Crafty, in an attempt to normalize this to the number of CPUs on the box being tested). Here, if I have 4 idle processors and split the tree at some point, I add 4 to splits, because I do the work of allocating 4 split blocks, copying the necessary data to those blocks, then later merging things back into the parent split block to continue the search. I had hoped that this would let me use a similar algorithm for 2 cpu boxes or 16 cpu boxes, but it doesn't appear to be so good.
Here are the same tests run on my office dual PIV/2.8ghz box:
Code: Select all
mindepth = 4 nodes/split = 87246 time=45.62 nps=1.7M
mindepth = 4 nodes/split = 61456 time=39.42 nps=1.8M
mindepth = 4 nodes/split = 62085 time=28.39 nps=1.8M
mindepth = 4 nodes/split = 165044 time=1:16 nps=1.7M
mindepth = 5 nodes/split = 229514 time=1:17 nps=1.8M
mindepth = 5 nodes/split = 103200 time=46.45 nps=1.7M
mindepth = 5 nodes/split = 69432 time=25.39 nps=1.7M
mindepth = 5 nodes/split = 191910 time=1:29 nps=1.8M
mindepth = 6 nodes/split = 155973 time=45.22 nps=1.7M
mindepth = 6 nodes/split = 141028 time=46.59 nps=1.7M
mindepth = 6 nodes/split = 260110 time=1:12 nps=1.6M
mindepth = 6 nodes/split = 150657 time=45.34 nps=1.8M
Before I go on, here is the endgame position that is more trouble, first on the 8-core box:
Code: Select all
mindepth = 10 nodes/split = 142 time=9.00 nps=6.5M
mindepth = 10 nodes/split = 209 time=27.99 nps=8.0M
mindepth = 10 nodes/split = 193 time=14.21 nps=7.6M
mindepth = 10 nodes/split = 186 time=24.61 nps=7.6M
mindepth = 14 nodes/split = 1360 time=6.69 nps=9.6M
mindepth = 14 nodes/split = 1315 time=5.88 nps=9.6M
mindepth = 14 nodes/split = 1265 time=5.22 nps=9.3M
mindepth = 14 nodes/split = 1719 time=10.18 nps=10.3M
mindepth = 18 nodes/split = 7458 time=6.44 nps=8.4M
mindepth = 18 nodes/split = 15357 time=20.36 nps=9.6M
mindepth = 18 nodes/split = 9637 time=9.65 nps=8.1M
mindepth = 18 nodes/split = 10594 time=8.24 nps=8.9M
mindepth = 22 nodes/split = 110309 time=18.56 nps=6.6M
mindepth = 22 nodes/split = 89241 time=11.09 nps=7.4M
mindepth = 22 nodes/split = 95615 time=15.97 nps=8.6M
mindepth = 22 nodes/split = 83646 time=16.57 nps=7.3M
On my dual PIV:
Code: Select all
mindepth = 10 nodes/split = 15620 time=9.96 nps=2.8M
mindepth = 10 nodes/split = 14749 time=9.50 nps=2.7M
mindepth = 10 nodes/split = 17587 time=12.74 nps=2.8M
mindepth = 10 nodes/split = 20643 time=12.85 nps=2.9M
mindepth = 14 nodes/split = 67004 time=17.72 nps=2.9M
mindepth = 14 nodes/split = 36086 time=11.20 nps=2.7M
mindepth = 14 nodes/split = 29248 time=9.13 nps=2.8M
mindepth = 14 nodes/split = 33755 time=10.47 nps=2.7M
mindepth = 18 nodes/split = 223892 time=17.36 nps=2.7M
mindepth = 18 nodes/split = 120295 time=11.48 nps=2.6M
mindepth = 18 nodes/split = 249113 time=25.99 nps=2.7M
mindepth = 18 nodes/split = 117509 time=16.36 nps=2.5M
mindepth = 22 nodes/split = 440457 time=18.59 nps=2.1M
mindepth = 22 nodes/split = 808569 time=30.93 nps=2.5M
mindepth = 22 nodes/split = 737384 time=30.79 nps=2.4M
mindepth = 22 nodes/split = 502934 time=18.88 nps=2.3M
I am trying to find some "metric" that I can use across both these machines (plus a couple of quads I have as well) so that this can "self-tune" as the search runs.
But so far I have not found a good "metric" that can somehow be normalized across the processors so that I can try to "bracket" that metric and if the metric goes "outside" the window, I can adjust mindepth accordingly. There is another parameter or two that I also want to adjust, but the effect is somewhat similar and I have omitted that data to keep this simple. I can limit how many processors split at any one point. I have been using 4 because that works well on most machines and is particularly effective on quad-core boxes since all cores may share either some cache, or in the case of AMD local memory that is faster than remote memory to access.
The question is, what sort of measure might work across different CPU speeds, different numbers of CPUs, in a uniform way? I had hoped that nodes per split would work, but that looks almost useless. I had toyed with normalizing this to 1M nodes per second so that if you go 10x faster, you can get away with 1/10th the nodes per split, but I could not come up with any rational explanation why that looks decent with the current data, so that I could feel some confidence that it would work for all machines.
Any ideas???
One reason for not splitting very close to the tips, is that in addition to doing way more splits and incurring the overhead (however small it may be) I would also incur the overhead of worse move ordering since positions right at the tips likely have no hash entries to help with move ordering, things are less stable near the leaves anyway and as move ordering goes bad, parallel search overhead goes up... That is one reason I split at the root after the parallel search has slowly worked its way back to the root and reported a real root score, as at the root there is _zero_ overhead to speak of, all we lose out on is a few hash hits here and there...
One idea I have been looking at is to simply try to find the lowest setting that produces a good NPS value, as that would sort of do the trick. But it is not that easy to detect that between iterations without stepping "over" the sweet spot, something I don't want to do.
However, perhaps that might well be the best answer for this.