I would like to know the formula for the error in estimating the ELO difference with 95% confidence as commonly used by chess engine authors.
If I have 10 games between 2 engines and I get a 200 ELO difference which is the likely error? But for 100 games?
Estimating ELO difference
Moderators: hgm, Rebel, chrisw
-
- Posts: 98
- Joined: Sat Jul 31, 2010 8:48 pm
- Full name: Ubaldo Andrea Farina
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Estimating ELO difference
Roughly it is 560/sqrt(N), if N is the total number of game, and the result is in the 25-75% range.
For extreme scores, it becomes dependent on the exact Elo model you use, but the above formula can still be used as a rough estimate when you take for N the number of non-wins or non-losses (whichever is smaller). So if an engine scores 3 out of 100, the error is as if you played about 3 games.
For extreme scores, it becomes dependent on the exact Elo model you use, but the above formula can still be used as a rough estimate when you take for N the number of non-wins or non-losses (whichever is smaller). So if an engine scores 3 out of 100, the error is as if you played about 3 games.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Estimating ELO difference
This is how I would think about it. ELO difference is a number D such that E:=1/(1+10^(-D/400)) is the number of points one player is expected to get when playing the other.
We can concentrate on computing E instead of D. We are trying to estimate E, using the results of some games as evidence. Bayes's formula is the right thing to use, and this requires that we have a prior distribution for E. If we don't know anything else, a uniform distribution in [0,1] is a natural prior to use.
It turns out that after W wins and L losses, the posterior distribution for E follows a beta distribution with parameters W+1 and L+1.
The mean of a beta(W+1,L+1) distribution is mu:=(W+1)/(W+L+2) and its standard deviation is sigma:=sqrt((W+1)*(L+1)/((W+L+2)^2*(W+L+3))). You can try to convert the mean to an ELO score like this:
D = 400*log(1-1/mu)/log(10)
What to do about the standard deviation is trickier, but if you use a linear approximation to this formula (which should work well if the standard deviation is small), you just have to multiply sigma times the derivative of D as a function of mu. I get
sigma * 400/((1-1/mu)*mu^2*log(10))
In order to get a simpler formula, you can assume W and L are similar and large, and then I get
(1600/log(10))/sqrt(N) ~= 695/sqrt(N)
When N is the total number of games played.
It's not too too far from what hgm posted, and it is likely that I made a mistake somewhere. Does anyone have a derivation of the 560/sqrt(N) formula?
We can concentrate on computing E instead of D. We are trying to estimate E, using the results of some games as evidence. Bayes's formula is the right thing to use, and this requires that we have a prior distribution for E. If we don't know anything else, a uniform distribution in [0,1] is a natural prior to use.
It turns out that after W wins and L losses, the posterior distribution for E follows a beta distribution with parameters W+1 and L+1.
The mean of a beta(W+1,L+1) distribution is mu:=(W+1)/(W+L+2) and its standard deviation is sigma:=sqrt((W+1)*(L+1)/((W+L+2)^2*(W+L+3))). You can try to convert the mean to an ELO score like this:
D = 400*log(1-1/mu)/log(10)
What to do about the standard deviation is trickier, but if you use a linear approximation to this formula (which should work well if the standard deviation is small), you just have to multiply sigma times the derivative of D as a function of mu. I get
sigma * 400/((1-1/mu)*mu^2*log(10))
In order to get a simpler formula, you can assume W and L are similar and large, and then I get
(1600/log(10))/sqrt(N) ~= 695/sqrt(N)
When N is the total number of games played.
It's not too too far from what hgm posted, and it is likely that I made a mistake somewhere. Does anyone have a derivation of the 560/sqrt(N) formula?
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Estimating ELO difference
I actually see I made a mistake, and now I get a different result, where the final approximation ends up being (800/log(10))/sqrt(N), i.e. half of what I computed earlier. I'll redo things more carefully tonight and post what I get.
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Estimating ELO difference
I did not really calculate it from any model, I just used the rule of thumb that excess 1% score corresponds to 7 Elo, and that with a (quite typical) draw rate of 32% the standard deviation is 40%/sqrt(N). And that a 95% interval is about 2 sigma wde. So the 560 came about as 40 x 2 x 7. So its all very course estimates.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Estimating ELO difference
AlvaroBegue wrote:
(1600/log(10))/sqrt(N) ~= 695/sqrt(N)
When N is the total number of games played.
It's not too too far from what hgm posted, and it is likely that I made a mistake somewhere. Does anyone have a derivation of the 560/sqrt(N) formula?
It's very close. For 95.45% confidence (2 standard deviations), one has
(1600/ln(10))/sqrt(N) ~= 695/sqrt(N) times
sqrt(4*score*(1-score) - DrawFraction)
score = number of points / N
DrawFraction = number of draws / N
The final formula is
Error (2SD) in Elo points = 695 * sqrt(4*score*(1-score) - DrawFraction) / sqrt(N)
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Estimating ELO difference
And that is very close to what I had, as for Chess I assume a draw fraction of 1/3. So there would be a multiplier sqrt(0.66) ~ 0.8, which times 695 is about 560.
-
- Posts: 2041
- Joined: Wed Mar 08, 2006 8:30 pm
Re: Estimating ELO difference
That's the major error term of your formula: in engine-engine tests you can often see 2/3 draws.hgm wrote:as for Chess I assume a draw fraction of 1/3.
Then you have to divide your number by sqrt(2)
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Estimating ELO difference
I have never seen such a high draw fraction in any engine-engine testing I did. For Chess. (In Shogi the draw fraction is of course close to 0%.)