Hello Larry:
lkaufman wrote:My exact result was 1847 wins, 1566 losses, 5127 draws, so somewhat more games than I recalled. I show the error margin for this as 4.8 elo (that's supposed to be 95% confidence). Do you agree?
Larry
Thanks for sharing the results. 1847 + 1566 + 5127 = 8540 games, more than 6000 or 7000 games I thought when reading your previous post. The draw ratio is 5127/8540 ~ 60.04%, which is higher than what I expected (I think I misunderstood: if I thought from the very first moment that it was a self test, then I would supposed a draw ratio of near 60%; but I misunderstood, so I thought it was less). More games lowers error bars as well as an increasing draw ratio (with the model I use). Furthermore, your figure 4.8 is closer to 5 than to 4. All these misunderstandings have been accumulated.
For your result of +1847 -1566 =5127, I get more less +11.44 ± 4.66 Elo with 95% confidence (LOS ~ 100%). If the scores are near 50%-50%, error bars can be estimated as ±[800*z*sqrt(1 - D)]/[ln(10)*sqrt(games)] ~ ±347.44*z*sqrt[(1 - D)/games], where z is z-score in a normal distribution (z ~ 1.96 for 95% confidence) and D is the draw ratio. This estimate is highly dependant with the draw ratio, as you can see. But, when scores are close enough to 50%-50%? At first glance I would say that 1/[4*score*(1 - score)] < 1.02 (nothing empirical, of course!), that is, score ~ ]0.43, 0.57[ (a maximum gap of 49 Elo more less). Knowing that in this case score ~ 51.65% (it is inside the interval ]43%, 57%[) and D ~ 60%: ±800*1.96*sqrt[(1 - 0.6)/8540]/ln(10) ~ ± 4.66 Elo (there is almost no error with the result I wrote before).
------------------------
Your rule of the thumb formula can be applied because the score is greater than 43% and less than 57% and also because the draw ratio is something similar to 60% (Larry's result). With the model I use, draw ratio plays an important role in error bars.
------------------------
In your example of +1600 -1400 =4000, I get more less +9.93 ± 5.33 Elo for 95% confidence (LOS ~ 99.99%). At first approximation, it looks like you tried with 3-sigma confidence ~ 99.73% confidence: 8.15/(5.33/1.96) = 1.96*(8.15/5.33) ~ 3.
I do not know exactly what numbers are you using in the second example.
------------------------
Uri Blass wrote:I guess that the main point is that there are more draws in a match against a previous version and this is the reason for smaller error.
The stockfish team get more than 64% draws in the games against a previous version and I believe that hyatt clearly get less draws in his games.
with less draws when they tested against very old version the stockfish team found higher possible error for 20,000 games and 3.3>2.8
Here are 2 tests of the stockfish team
ELO: 56.66 +-3.3 (95%) LOS: 100.0%(candidate version for stockfish 4 against stockfish3)
Total: 20000 W: 6221 L: 2988 D: 10791
regression test of stockfish developement against stockfish4
ELO: 24.34 +-2.8 (95%) LOS: 100.0%
Total: 20000 W: 4224 L: 2825 D: 12951
You caught the point!

My previous answer in this post to you explains that.
------------------------
Uri Blass wrote:I do not see how you get error bar of 8 elo for 7000 games and I think that it is 4-5 elo.
You have 2.8 error bar after 20,000 games
see for example the regression of latest stockfish
http://tests.stockfishchess.org/tests/v ... 63f25cba49
you should have 2.8*sqrt(20,000/7000) after 7000 games that is between 4 elo and 5 elo.
Other answer to you, Uri.

Please see my answer to Larry at the beginning of this post: for an unknown reason I supposed that Larry's test was not a self test, so I expected a lower draw ratio of, let me say, 40%. My wrong assumption altered the estimate a lot! The number of games was quite different (8540 real games against 6000 or 7000 games I thought in a first moment).
------------------------
bob wrote:Here's some serious numbers:
Code: Select all
2 Crafty-23.6-2 2640 4 4 30080 65% 2519 24%
3 Crafty-23.6-1 2639 4 4 30080 65% 2519 25%
4 Crafty-23.7R02-50 2636 4 4 30080 64% 2519 24%
5 Crafty-23.7R03-1 2633 4 4 30080 64% 2519 25%
30,080 games => +/- 4 Elo using BayesElo.
At first approximation, I would say for a score of 64.5% and a draw ratio of 24.5%: ±800*1.96*sqrt[1/(4*0.645*0.355) - 0.245]/[ln(10)*sqrt(30080)] ~ ± 3.61 Elo.
Now, if I take +15717 -6993 =7370 (score ~ 65.5% and draw ratio ~ 24.5%; number of games: 30080), I obtain error bars of ± 3.51 Elo more less (using my own model). There is a 2.77% error between the estimate and the true error bar (I remember once again: always with my model, which is not perfect).
------------------------
Sorry for all this technical/mathematical stuff in the General Topics section.
Regards from Spain.
Ajedrecista.