I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.)
H.G. Muller posted this very important formula in one thread and I just want to make sure I got it right. So let's take just one example.
Match Result:
A - B: 460 - 440
Score percentage for A: 460 / 900 = 51.1%.
Error margin: 40% / sqrt(900) = 1.3%. Now where should I apply this error margin? Is it calculated directly for score percentage?
So the correct result is 51.1% +- 1.3% (with 84% confidence). Did I got this right?
Now if we are improving the engine through the self-play, the truly interesting question is
"With given match result 460-440 what is the confidence level that the correct score percentage is >=50%?". I know there can't be such an easy rule thumb formula here, but if someone has already figured out the more complicate one, please post it here
Correct, the error applies to the score percentage.
Beware with the confidence level; the 84% that I quoted for 1-sigma error bars was actually the one-sided confidence. So there is an 84% likelihood that the actual score percentage of the engine (i.e. th one you would get after infinitely many games ) is between 51.1%-1.3% and 100% (or between 0% and 51.1%+1.3%). The confidence you can have that the true score percentage will be between 51.1%-1.3% and 51.1%+1.3% (the two-sided confidence) is only 68%. So 68% of the results is normally between -sigma and +sigma, and 16% of the results on either side beyond sigma.
For 2-sigma (1.96-sigma, for the purists) the two-sided confidence is 95%, the one-sided confidence 97.5%. (I was a bit sloppy on this in my earlier remarks.)
The >=50% question asks for a one-sided confidence: you want to know how likely it is that the true score is between 50% and 100%. To know the exact cofidence, you would need a table of the 'error function'.
I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice
Nice approximation!
Exact value means replacing the 40% by 41% if the draw ratio is 1/3 and by 35% if the draw ratio is 1/2, so 40% is good enough, especially for an error formula
Just multiply by 7, and you have the error in Elo points...
I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice
Nice approximation!
Exact value means replacing the 40% by 41% if the draw ratio is 1/3 and by 35% if the draw ratio is 1/2, so 40% is good enough, especially for an error formula
Just multiply by 7, and you have the error in Elo points...
For a reference, I am seeing drawing rates just under 30% in my longer-game cluster testing. For very fast games it drops to 22-23%.
hgm wrote:Correct, the error applies to the score percentage.
Beware with the confidence level; the 84% that I quoted for 1-sigma error bars was actually the one-sided confidence. So there is an 84% likelihood that the actual score percentage of the engine (i.e. th one you would get after infinitely many games ) is between 51.1%-1.3% and 100% (or between 0% and 51.1%+1.3%). The confidence you can have that the true score percentage will be between 51.1%-1.3% and 51.1%+1.3% (the two-sided confidence) is only 68%. So 68% of the results is normally between -sigma and +sigma, and 16% of the results on either side beyond sigma.
For 2-sigma (1.96-sigma, for the purists) the two-sided confidence is 95%, the one-sided confidence 97.5%. (I was a bit sloppy on this in my earlier remarks.)
The >=50% question asks for a one-sided confidence: you want to know how likely it is that the true score is between 50% and 100%. To know the exact cofidence, you would need a table of the 'error function'.
There is an additional subtlety if the score is very one sided, say 90:10. The 68% confidence intervals will be +X% -Y%, with X<Y, and 95% confidence intervals will deviate from +2X% -2Y%.