So far, so good. The results of the 40-game matches are not correlated, as they are the same every time. But game N and N+40 are perfectly correlated if you consider them as individual games within the total set of games.Karl Juhnke wrote:I was unguarded in what I wrote to you by e-mail, Dr. Hyatt, but here in the public forum I hope to be a model of civility. Shame on me if I raise the emotional temperature of the discussion instead of discussing calmly and rationally.
It seems that some of what you posted from my letter was not sufficiently clear, based on the immediate responses it drew:
andUri Blass wrote:Correlation between 2 varaibales that get 1 with probability of 1 is not definedTo clarify, consider the situation where the same starting positions are used repeatedly with fixed opponents and fixed node counts, guaranteeing the outcomes are the same in every run. The result of the first trial run are indeed correlated to the results of every subsequent trial run. For simplicity let me assume a position set of 40, only one opponent, only one color, and no draws. The results of the first trial run might beH. G. Muller wrote:Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit
theorem.
X = 1110010001000011011111000010011111111101.
Then the result of the second trial run will be the same
Y = 1110010001000011011111000010011111111101.
The sample coefficient of correlation is well-defined, and may be calculated (using the formula on Wikipedia) as
(N * sum(x_i * y_i) - sum(x_i)*sum(y_i)) / (sqrt(N * sum(x_i^2) - sum(x_i)^2) * sqrt(N * sum(y_i^2) - sum(y_i)^2))
= (40*23 - 23*23) / (sqrt(40*23 - 23*23) * sqrt(40*23 - 23*23))
= (40*23 - 23*23) / (40*23 - 23*23)
= 1
Thus the coefficient of correlation between the first run and the second is unity, corresponding to the intuitive understanding that the two trials are perfectly correlated. These repeated trials do not, in fact, obey all the laws of independent sampling. In particular, there is no guarantee that the sample mean will converge to the true mean of the random variable X as the sample size goes to infinity. I took these numbers from a random
number generator which was supposed to be 50% zeros and 50% ones. Intuitively 23 of 40 is not an alarming deviation from the true mean, but if we repeat the trial a thousand times, 23000 of 40000 is statistically highly improbable. For the mathematically inclined, we can make precise the calculation that repeated trials provide "no new information".
For our random variable X taking values 1 and 0 with assumed mean 0.5, each trial has variance (1 - 0.5)^2 = (0 - 0.5)^2 = 0.25. The variance of the sum of the first forty trials, since they are independent with no covariance, is simply the sum of the variances, i.e. 40*0.25=10. The variance of the mean is 10/(40^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079
Now we add in random variable Y. The covariance of X and Y is E(XY) - E(X)E(Y) = 0.5 - 0.5*0.5 = 0.25. When adding two random variables, the variance of the sum is the sum of the variances plus twice the sum of the covariance. Thus the variance of the sum of our eighty scores will be 40 * 0.25 + 2 * 40 * 0.25 + 40 * 0.25 = 40. The variance of the mean will be 40/(80^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079.
If we have M perfectly correlated trials, there will be covariance between each pair of trials. Since there are M(M-1)/2 pairs of trials, there will be this many covariance terms in the formula for the variance of the sum. Thus the varaince of the sum will be M * 40 * 0.25 + 2 * 40 * 0.25 * M(M-1)/2 = 10M + 10(M^2 - M) = 10M^2. The variance of the mean will be 10(M^2)/((40M)^2) = (1/160). The standard deviation of the mean is sqrt(1/160) ~ 0.079.
No matter how large the test size, we should expect results of 50% plus or minus 7.9%. The central limit theorem implies (among other things) that the sample mean goes to zero, which isn't happening here. I apologize if the detail seems pedantic; it is in service of clarity which my original letter obviously did not provide.
Watch out here! What standard deviation are you talking about? A standard deviation is defined as a sum over events. What events do you sum over? If you are talking about the (hypothetical) sampling of all possible starting positions played by all possible opponents, you would be right that a sampling procedure limiting the samples to a single position (or just a few), that is in itself randomly chosen, will drive up the standard deviation of the match results by correlating the game. So if procedure A is randomly selecting a position, and then play 80 games with that position, and repeating the full procedure (including selection of a new random position) a number of times might give you a larger SD of the 80-game match results than when you would independently select the position for each individual game randomly (procedure B).The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the foe of accurate measurement. In fact, both intuitions are correct in different contexts.
If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test. Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are playing slightly different opposition makes difference between A and A' slightly harder to detect.
If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use exactly the same node count for the same game when A' plays in our comparison game.
I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison between engines).
I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured performance higher than it would be for perfectly independent results.
OTOH, if you would randomly select a position only once, (procedure C) and now repeatedly use that random position in several 80-game matches (perhaps differing in opponents, number of nodes, time jitter or whatever), the standard deviation of the 80-game match results will be smaller then when using the totally random sampling, because you correlate games between different samples you are calculating the SD over. While in procedure A you were correlating only the games within a single match, keeping different matches uncorrelated.
Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into giving confidence intervals that are too narrow.
Thank you for allowing me to participate in this discussion.