hgm wrote:Wrong!
1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.
Correct, it is 4x
The error of self testing (Es) is proportional to the square root of the number of games (n).
Es = k * sqrt(n)
fractional error (divide by n)
Es/n = k/sqrt(n)
Now, lets run two gauntlets, one for version 1, one for version 2
If the total number of games in each gauntlet is m
E1 = k * sqrt(m) // error of the relative Elo Value V1 in the the gauntlet for the first version
E2 = k * sqrt(m) // error of the relative Elo Value V2 in the the gauntlet for the second version
The diff elor value calculated from two gauntlets is V2-V1, and the variance is (square of the error = variance, Ec^2 = Variance)
Variance = Ec^2 = E1^2 + E2^2
Ec^2 = k^2 * m + k^2 * m
Ec^2 = 2 * k^2 * m
Hence, total error of the difference V2-V1
Ec = k * sqrt(2) * sqrt(m)
So, the fractional error is
Ec/(m) = k * sqrt(2) /sqrt(m)
When the fractional errors for the self testing and the two gauntlets are the same?
Ec/m = Es/n when
k / sqrt(n) = sqrt(2) * k / sqrt(m)
sqrt(m)/sqrt(n) = sqrt(2)
m/n = 2
The number of games per gauntlet should be double, but since we have to run two gauntlets, the number of games needed are 4x
For instance if we run 1000 games in self testing, we need to run 2000 games in the gauntlet 1, and 2000 games in gauntlet 2.
Miguel
2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.
In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.