New testing thread

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Correlated data discussion

Post by hgm »

Karl Juhnke wrote:I was unguarded in what I wrote to you by e-mail, Dr. Hyatt, but here in the public forum I hope to be a model of civility. Shame on me if I raise the emotional temperature of the discussion instead of discussing calmly and rationally.

It seems that some of what you posted from my letter was not sufficiently clear, based on the immediate responses it drew:
Uri Blass wrote:Correlation between 2 varaibales that get 1 with probability of 1 is not defined
and
H. G. Muller wrote:Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit
theorem.
To clarify, consider the situation where the same starting positions are used repeatedly with fixed opponents and fixed node counts, guaranteeing the outcomes are the same in every run. The result of the first trial run are indeed correlated to the results of every subsequent trial run. For simplicity let me assume a position set of 40, only one opponent, only one color, and no draws. The results of the first trial run might be

X = 1110010001000011011111000010011111111101.

Then the result of the second trial run will be the same

Y = 1110010001000011011111000010011111111101.

The sample coefficient of correlation is well-defined, and may be calculated (using the formula on Wikipedia) as

(N * sum(x_i * y_i) - sum(x_i)*sum(y_i)) / (sqrt(N * sum(x_i^2) - sum(x_i)^2) * sqrt(N * sum(y_i^2) - sum(y_i)^2))
= (40*23 - 23*23) / (sqrt(40*23 - 23*23) * sqrt(40*23 - 23*23))
= (40*23 - 23*23) / (40*23 - 23*23)
= 1

Thus the coefficient of correlation between the first run and the second is unity, corresponding to the intuitive understanding that the two trials are perfectly correlated. These repeated trials do not, in fact, obey all the laws of independent sampling. In particular, there is no guarantee that the sample mean will converge to the true mean of the random variable X as the sample size goes to infinity. I took these numbers from a random
number generator which was supposed to be 50% zeros and 50% ones. Intuitively 23 of 40 is not an alarming deviation from the true mean, but if we repeat the trial a thousand times, 23000 of 40000 is statistically highly improbable. For the mathematically inclined, we can make precise the calculation that repeated trials provide "no new information".

For our random variable X taking values 1 and 0 with assumed mean 0.5, each trial has variance (1 - 0.5)^2 = (0 - 0.5)^2 = 0.25. The variance of the sum of the first forty trials, since they are independent with no covariance, is simply the sum of the variances, i.e. 40*0.25=10. The variance of the mean is 10/(40^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079

Now we add in random variable Y. The covariance of X and Y is E(XY) - E(X)E(Y) = 0.5 - 0.5*0.5 = 0.25. When adding two random variables, the variance of the sum is the sum of the variances plus twice the sum of the covariance. Thus the variance of the sum of our eighty scores will be 40 * 0.25 + 2 * 40 * 0.25 + 40 * 0.25 = 40. The variance of the mean will be 40/(80^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079.

If we have M perfectly correlated trials, there will be covariance between each pair of trials. Since there are M(M-1)/2 pairs of trials, there will be this many covariance terms in the formula for the variance of the sum. Thus the varaince of the sum will be M * 40 * 0.25 + 2 * 40 * 0.25 * M(M-1)/2 = 10M + 10(M^2 - M) = 10M^2. The variance of the mean will be 10(M^2)/((40M)^2) = (1/160). The standard deviation of the mean is sqrt(1/160) ~ 0.079.

No matter how large the test size, we should expect results of 50% plus or minus 7.9%. The central limit theorem implies (among other things) that the sample mean goes to zero, which isn't happening here. I apologize if the detail seems pedantic; it is in service of clarity which my original letter obviously did not provide.
So far, so good. The results of the 40-game matches are not correlated, as they are the same every time. But game N and N+40 are perfectly correlated if you consider them as individual games within the total set of games.
The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the foe of accurate measurement. In fact, both intuitions are correct in different contexts.

If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test. Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are playing slightly different opposition makes difference between A and A' slightly harder to detect.


If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison between engines).

I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured performance higher than it would be for perfectly independent results.
Watch out here! What standard deviation are you talking about? A standard deviation is defined as a sum over events. What events do you sum over? If you are talking about the (hypothetical) sampling of all possible starting positions played by all possible opponents, you would be right that a sampling procedure limiting the samples to a single position (or just a few), that is in itself randomly chosen, will drive up the standard deviation of the match results by correlating the game. So if procedure A is randomly selecting a position, and then play 80 games with that position, and repeating the full procedure (including selection of a new random position) a number of times might give you a larger SD of the 80-game match results than when you would independently select the position for each individual game randomly (procedure B).

OTOH, if you would randomly select a position only once, (procedure C) and now repeatedly use that random position in several 80-game matches (perhaps differing in opponents, number of nodes, time jitter or whatever), the standard deviation of the 80-game match results will be smaller then when using the totally random sampling, because you correlate games between different samples you are calculating the SD over. While in procedure A you were correlating only the games within a single match, keeping different matches uncorrelated.
Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into giving confidence intervals that are too narrow.

Thank you for allowing me to participate in this discussion.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

xsadar wrote:snipped
The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the
foe of accurate measurement. In fact, both intuitions are correct in different contexts.

If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test.
Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against
opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same
test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts
from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are
playing slightly different opposition makes difference between A and A' slightly harder to detect.


If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).

I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good
randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different
opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured
performance higher than it would be for perfectly independent results.
This point makes sense, but I don't like it. That means we would need 10 times as many positions as I originally thought for an ideal test. Also, if we don't play the positions as both white and black, it seems (to me at least) to make it even more important that the positions be about equal for white and black. I hope your kind enough, Bob, to make the positions you finally settle on (however many that may be) available to the rest of us.
Of course. However, once I run the first run (cluster is a bit busy but I just started a test so results will start to trickle in) I would like to start a discussion about how to select the positions.

To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.

What I have done, is to take the PGN collection we use for our normal wide book (good quality games) and then just modify the book create procedure so that on wtm move 11 (both sides have played 10 moves) I write out the FEN. The good side is that probably most of these positions are decent. The down side is that this does not cover unusual openings very well, and might not cover some at all. So you might find out your program does well in normal positions but you might not know it handles off-the-wall positions poorly...

So there is a ton of room for further discussion. But let me get some hard Elo data from these positions. I want to run them 4 times so that I get 4 sets of Elo data, which will hopefully be very close to the same each time...

At first I thought it wouldn't work so well if the different opponents all played from different positions, or if the positions weren't played from each color's perspective, but if you're only testing to see if engine A' is better or worse than engine A, I don't think it matters as long as A is tested identically to A'.
Of course if you played A against A' to see which is better, then they would need to each play white and black for each position that they play against each other, but then count each position as a data point rather than each game.
Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on
the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into
giving confidence intervals that are too narrow.

Thank you for allowing me to participate in this discussion.
--Karl Juhnke
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: Correlated data discussion

Post by Zach Wegner »

Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
Isn't this pretty much exactly what I've been saying?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Uri Blass wrote:I am going to read more and respond later but I understand the source of the misunderstanding.

I thought about correlation between result of game from position 1 and result of game from position 2 when karl looked at different variables namely results of the same position in different matches.

I agree that correlation of results from the same position may reduce the variance but if this is the only correlation and this correlation coefficient is positive then it can only reduce the total variance and I looked for explanation of big variance in the results.

Edit:From fast looking at karl's post he explains why repeating the same experiment is a bad idea to find small changes but it does not explain the results.

There are basically 2 questions:
1)Is it good to test in the way hyatt test
2)Do you expect to get the same rating in different tests.

In the case that you always get 23-17 between equal programs you may get wrong rating but your rating is going to be always the same.

In the case of Crafty the rating was not always the same and the only logical reason is simply that Bob hyatt did not repeat the same experiment.
You keep saying that and I will keep saying it is _wrong_. Perhaps, one day, you will get off that horse, because it won't ride.

Even if the cluster became at average 0.1% slower after many games then it means that the experiment is different.

Uri
That is about the silliest thing I have read. Exactly what is going to make a crystal oscillator drift .1%??? How could we have TV/Radio broadcasts? etc. It doesn't happen. Unfortunately it is not something that is "accessible" so it is not measurable anyway. And it is also something _everyone_ would then suffer from, which means it is an uncontrollable and therefore moot issue. The test just has to be able to withstand that tiny variation by playing enough games that such a tiny change would not matter. And then there is the issue of how this tiny change is going to somehow make games "dependent" which has been a continual claim here each time testing comes up... Since the games are run in a particular order, but which node they run on is random, I don't see how this could be measured, corntrolled, or an issue unique to my test platform.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 4 sets of data

Post by bob »

hgm wrote:
bob wrote:At times, discussing things with you is like discussing the _same_ technical ideas with a 6-year-old.
I am not surprised. You seem to learn as much from any discussion as from talking to a brick wall...
"Ton" == "Too much to be useful". That has been what this discussion has been about from post #1. Never changed. Never will. Do you think you can follow that simple idea for a while? I want to be able to run the fastest possible test and measure minor improvements in a program.
Well, if you were able to compute a square root, you would have known that 800 games, gives you a standard error of 1.7% ~ 12 Elo, so if your idea of a small change is <20 Elo, that is too much to be useful. I would have thought a 6-yesar old culd understand the difference between smaller than 20 and larger than 20.
Believe I got that 55 years ago or so. The point being addressed was how does something with a SD of 12, (or less) still produce two runs where the results are well outside that range. Get that? I just reduced the number of positions so that I could run the various tests being suggested, and not make the discussion drag out over months. I fully understand that 25,000 games produces a smaller SD than 800. But, unlike yourself, I also fully understand that running 800 games takes a lot less time, and if you recognize that the SD goes up and the size goes down, the discussion can _still_ reach some sort of conclusion that could be verified on the bigger run if necessary.
That's all I have been trying to discover how to do since my very first post. And while this data might satisfy your "well within statistical bounds" but I can't use it for the purpose I defined.
Yes, that was obvious from the beginning. So why do you persistently try? Doesn't seem very useful to me...
Because _everybody_ is using some sort of a vs b testing to either measure rating differences to see who is better, or whether a change was good. No idea how you do your testing and draw your conclusions, and I don't really care. But I am addressing what _most_ are doing. And things are progressing in spite of your many non-contributions, thank you.
So is it _possible_ that we stay on _that_ train of thought for a while, and stop going off into the twilight zone most of the time. I want to measure small improvements. That data won't do so.
Play enough independent games, then. If you want to reliably see difference of 7 Elo, 800 will never be enough, as computing a square will tell you. And no number of games will ever be enough if you use only 40 positions or 5 opponents. The sum of the statistical noise caused by opponent sampling, position sampling and game sampling will have to be well below 1% (scorewise). As I told you ages ago.
OK, we are playing with number of positions. I am up to almost 4,000 and am testing this. How many opponents? 4,000 needed there too? If so, I can actually play 16,000,000 games. But can anyone else? Didn't think so, so we need something that is both (a) useful and (b) doable. So far you are providing _neither_ while at least others are making suggestions that can be tested.

BTW your "most frequent suggestion, by far" has not been to use more positions or more opponents. 99% of your posts are "stampee feet, testing flawed, stampee feet, cluster is broken, stampee feet, there are dependencies between the games, stampee feet, stampee feet." None of which is useful. I had already pointed out that the _only_ dependencies present were tied to same opponents and same positions. But not where one game influences another in any possible way. But we don't seem to be able to get away from that. Meanwhile, in spite of all the noise, there is a small and steadily helpful signal buried in here that others are contributing. And I am willing to test 'em all without dismissing _anything_ outright. Unlike yourself.
Fritzlein

Re: Correlated data discussion

Post by Fritzlein »

Zach Wegner wrote:Isn't this pretty much exactly what I've been saying?
Yes it is. I apologize for not quoting you directly; before my account was activated it was difficult to respond to everyone at once, but I agreed with the distinction you were making, and just wanted to try to say it clearly myself.
Uri Blass
Posts: 10902
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Correlated data discussion

Post by Uri Blass »

bob wrote:
Uri Blass wrote:I am going to read more and respond later but I understand the source of the misunderstanding.

I thought about correlation between result of game from position 1 and result of game from position 2 when karl looked at different variables namely results of the same position in different matches.

I agree that correlation of results from the same position may reduce the variance but if this is the only correlation and this correlation coefficient is positive then it can only reduce the total variance and I looked for explanation of big variance in the results.

Edit:From fast looking at karl's post he explains why repeating the same experiment is a bad idea to find small changes but it does not explain the results.

There are basically 2 questions:
1)Is it good to test in the way hyatt test
2)Do you expect to get the same rating in different tests.

In the case that you always get 23-17 between equal programs you may get wrong rating but your rating is going to be always the same.

In the case of Crafty the rating was not always the same and the only logical reason is simply that Bob hyatt did not repeat the same experiment.
You keep saying that and I will keep saying it is _wrong_. Perhaps, one day, you will get off that horse, because it won't ride.

Even if the cluster became at average 0.1% slower after many games then it means that the experiment is different.

Uri
That is about the silliest thing I have read. Exactly what is going to make a crystal oscillator drift .1%??? How could we have TV/Radio broadcasts? etc. It doesn't happen. Unfortunately it is not something that is "accessible" so it is not measurable anyway. And it is also something _everyone_ would then suffer from, which means it is an uncontrollable and therefore moot issue. The test just has to be able to withstand that tiny variation by playing enough games that such a tiny change would not matter. And then there is the issue of how this tiny change is going to somehow make games "dependent" which has been a continual claim here each time testing comes up... Since the games are run in a particular order, but which node they run on is random, I don't see how this could be measured, corntrolled, or an issue unique to my test platform.
Note that what I say here is not important for testing small changes
but the point is that there must be something different between the first set of 25000 games and the second set to explain the results and I see no reason to insist that there is no difference.

If you play with 10,000,000 nodes per move and later with 10,010,000 nodes per move then you do not repeat the same experiment.

The same if you play first with 10,000,000+noise nodes per move and later with 10,010,000+noise nodes per move when noise is taken from the same distribution.

I do not know how the cluster works and I do not know if it can become slightly slower or if it can change slightly the way it measure time but it is clear that something was at least slightly changed during the games.

If you take games from the same distribution then there can be no big difference in the results(It can lead to misleading results when you think that one program is better when it is not better but it is a different point).

Uri
Uri Blass
Posts: 10902
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: 4 sets of data

Post by Uri Blass »

bob wrote:
hgm wrote:
bob wrote:At times, discussing things with you is like discussing the _same_ technical ideas with a 6-year-old.
I am not surprised. You seem to learn as much from any discussion as from talking to a brick wall...
"Ton" == "Too much to be useful". That has been what this discussion has been about from post #1. Never changed. Never will. Do you think you can follow that simple idea for a while? I want to be able to run the fastest possible test and measure minor improvements in a program.
Well, if you were able to compute a square root, you would have known that 800 games, gives you a standard error of 1.7% ~ 12 Elo, so if your idea of a small change is <20 Elo, that is too much to be useful. I would have thought a 6-yesar old culd understand the difference between smaller than 20 and larger than 20.
Believe I got that 55 years ago or so. The point being addressed was how does something with a SD of 12, (or less) still produce two runs where the results are well outside that range. Get that? I just reduced the number of positions so that I could run the various tests being suggested, and not make the discussion drag out over months. I fully understand that 25,000 games produces a smaller SD than 800. But, unlike yourself, I also fully understand that running 800 games takes a lot less time, and if you recognize that the SD goes up and the size goes down, the discussion can _still_ reach some sort of conclusion that could be verified on the bigger run if necessary.
That's all I have been trying to discover how to do since my very first post. And while this data might satisfy your "well within statistical bounds" but I can't use it for the purpose I defined.
Yes, that was obvious from the beginning. So why do you persistently try? Doesn't seem very useful to me...
Because _everybody_ is using some sort of a vs b testing to either measure rating differences to see who is better, or whether a change was good. No idea how you do your testing and draw your conclusions, and I don't really care. But I am addressing what _most_ are doing. And things are progressing in spite of your many non-contributions, thank you.
So is it _possible_ that we stay on _that_ train of thought for a while, and stop going off into the twilight zone most of the time. I want to measure small improvements. That data won't do so.
Play enough independent games, then. If you want to reliably see difference of 7 Elo, 800 will never be enough, as computing a square will tell you. And no number of games will ever be enough if you use only 40 positions or 5 opponents. The sum of the statistical noise caused by opponent sampling, position sampling and game sampling will have to be well below 1% (scorewise). As I told you ages ago.
OK, we are playing with number of positions. I am up to almost 4,000 and am testing this. How many opponents? 4,000 needed there too? If so, I can actually play 16,000,000 games. But can anyone else? Didn't think so, so we need something that is both (a) useful and (b) doable. So far you are providing _neither_ while at least others are making suggestions that can be tested.

BTW your "most frequent suggestion, by far" has not been to use more positions or more opponents. 99% of your posts are "stampee feet, testing flawed, stampee feet, cluster is broken, stampee feet, there are dependencies between the games, stampee feet, stampee feet." None of which is useful. I had already pointed out that the _only_ dependencies present were tied to same opponents and same positions. But not where one game influences another in any possible way. But we don't seem to be able to get away from that. Meanwhile, in spite of all the noise, there is a small and steadily helpful signal buried in here that others are contributing. And I am willing to test 'em all without dismissing _anything_ outright. Unlike yourself.
I think that in theory H.G.Muller is right that 5 opponents may not be enough but in practice 5 opponents are enough.

You can use 100 opponents when you give Crafty to play against everyone from different positions but I think that it is probably waste of time and if you use free source code programs then you may be unable to find 100 opponents that are strong enough to be competitive with Crafty.

Uri
Fritzlein

Re: Correlated data discussion

Post by Fritzlein »

Uri Blass wrote:I agree that correlation of results from the same position may reduce the variance but if this is the only correlation and this correlation coefficient is positive then it can only reduce the total variance and I looked for explanation of big variance in the results.
Yes, in addition to the portion of my letter you saw quoted, I was also trying to give a plausible explanation of the big difference between the two trial runs that kicked off this discussion. The increased variance within the results of the first 25600 games seems very plausible to me, but it doesn't by itself explain why the second set of 25600 games had such a different result.

What correlation of games within a test set does explain is this: IF something changed between the first set and second set, the thing that changed does not have to be systematically in favor of a particular engine to explain the huge difference in results. It also does not have to be something that makes the results of one playout causally dependent on any previous playout. I am explicitly suggesting that neither of these types of changes happened. I expect rather it was a random change with a random effect on the which side wins from each position. But if the first 25600 positions are internally correlated and the second 25600 positions are also internally correlated, then any random change between the two set can cause unexpectedly large swings in results.

I do not know enough about computer science to know what might possibly have changed, but it could be anything and still have this effect. As it happens, because I suggested that something did change between the two runs, this portion of my letter was summarized by Dr. Hyatt as "stampee foot". ;)
Fritzlein

Re: Correlated data discussion

Post by Fritzlein »

hgm wrote:Watch out here! What standard deviation are you talking about? A standard deviation is defined as a sum over events. What events do you sum over? If you are talking about the (hypothetical) sampling of all possible starting positions played by all possible opponents, you would be right that a sampling procedure limiting the samples to a single position (or just a few), that is in itself randomly chosen, will drive up the standard deviation of the match results by correlating the game. So if procedure A is randomly selecting a position, and then play 80 games with that position, and repeating the full procedure (including selection of a new random position) a number of times might give you a larger SD of the 80-game match results than when you would independently select the position for each individual game randomly (procedure B).

OTOH, if you would randomly select a position only once, (procedure C) and now repeatedly use that random position in several 80-game matches (perhaps differing in opponents, number of nodes, time jitter or whatever), the standard deviation of the 80-game match results will be smaller then when using the totally random sampling, because you correlate games between different samples you are calculating the SD over. While in procedure A you were correlating only the games within a single match, keeping different matches uncorrelated.
I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone. What I have explained (to my own satisfaction at least) is why each of the two 25600-game test runs can be expected to have a large error relative to Crafty's true strength, and thus I only need any old random change between the two big runs to complete the less interesting (to me) explanation as well.