Here are the first three runs I have tried. The only thing significant is that these positions are still producing results that are sane with respect to SD overlap. These runs represent using almost exactly 1/2 of the initial positoins. I simply skipped every other one in this test to cut the workload by 1/2. I have a total of 4 runs set up, three have completed. BTW these are all the same version. I have to use different "names" so that each match produces a directory with that name that contains all the PGN, which I am keeping for the moment.
next thing I am interested in is original positions again, but playing only one game each. It is hard in the current testing approach to alternate colors at one game per position, so I'll have to modify the referee if I decide to try that. But I am going to make 2 runs with crafty white in every position, then two more with crafty black, which will be interesting data as well. I'll run those once the current test is finished. Here is the partial results so far (three total runs):
Error bar was increased by +/- 1 roughly, for 1/2 the work. And again the results look very stable. I still plan on another couple of tests, the first is two rounds just playing white in all positions, the two rounds just playing black, which will be interesting to look at. Then another round with 1/4 the work rather than 1/2 to see how 4 runs like that will look...