Estimating ELO difference

mihaiv · Post by **mihaiv** » Wed Nov 24, 2010 10:48 am

I would like to know the formula for the error in estimating the ELO difference with 95% confidence as commonly used by chess engine authors.
If I have 10 games between 2 engines and I get a 200 ELO difference which is the likely error? But for 100 games?

uaf · Post by **uaf** » Wed Nov 24, 2010 11:59 am

http://www.mizarchessengine.com/columns ... ss-engine/
http://www.ascotti.org/programming/ches ... ngines.htm

hgm · Post by **hgm** » Wed Nov 24, 2010 5:35 pm

Roughly it is 560/sqrt(N), if N is the total number of game, and the result is in the 25-75% range.

For extreme scores, it becomes dependent on the exact Elo model you use, but the above formula can still be used as a rough estimate when you take for N the number of non-wins or non-losses (whichever is smaller). So if an engine scores 3 out of 100, the error is as if you played about 3 games.

AlvaroBegue · Post by **AlvaroBegue** » Wed Nov 24, 2010 8:12 pm

This is how I would think about it. ELO difference is a number D such that E:=1/(1+10^(-D/400)) is the number of points one player is expected to get when playing the other.

We can concentrate on computing E instead of D. We are trying to estimate E, using the results of some games as evidence. Bayes's formula is the right thing to use, and this requires that we have a prior distribution for E. If we don't know anything else, a uniform distribution in [0,1] is a natural prior to use.

It turns out that after W wins and L losses, the posterior distribution for E follows a beta distribution with parameters W+1 and L+1.

The mean of a beta(W+1,L+1) distribution is mu:=(W+1)/(W+L+2) and its standard deviation is sigma:=sqrt((W+1)*(L+1)/((W+L+2)^2*(W+L+3))). You can try to convert the mean to an ELO score like this:

D = 400*log(1-1/mu)/log(10)

What to do about the standard deviation is trickier, but if you use a linear approximation to this formula (which should work well if the standard deviation is small), you just have to multiply sigma times the derivative of D as a function of mu. I get

sigma * 400/((1-1/mu)*mu^2*log(10))

In order to get a simpler formula, you can assume W and L are similar and large, and then I get

(1600/log(10))/sqrt(N) ~= 695/sqrt(N)

When N is the total number of games played.

It's not too too far from what hgm posted, and it is likely that I made a mistake somewhere. Does anyone have a derivation of the 560/sqrt(N) formula?

AlvaroBegue · Post by **AlvaroBegue** » Wed Nov 24, 2010 8:29 pm

I actually see I made a mistake, and now I get a different result, where the final approximation ends up being (800/log(10))/sqrt(N), i.e. half of what I computed earlier. I'll redo things more carefully tonight and post what I get.

hgm · Post by **hgm** » Wed Nov 24, 2010 10:50 pm

I did not really calculate it from any model, I just used the rule of thumb that excess 1% score corresponds to 7 Elo, and that with a (quite typical) draw rate of 32% the standard deviation is 40%/sqrt(N). And that a 95% interval is about 2 sigma wde. So the 560 came about as 40 x 2 x 7. So its all very course estimates.

Laskos · Post by **Laskos** » Thu Nov 25, 2010 12:31 am

AlvaroBegue wrote:
(1600/log(10))/sqrt(N) ~= 695/sqrt(N)

When N is the total number of games played.

It's not too too far from what hgm posted, and it is likely that I made a mistake somewhere. Does anyone have a derivation of the 560/sqrt(N) formula?

It's very close. For 95.45% confidence (2 standard deviations), one has

(1600/ln(10))/sqrt(N) ~= 695/sqrt(N) times
sqrt(4*score*(1-score) - DrawFraction)

score = number of points / N
DrawFraction = number of draws / N

The final formula is

Error (2SD) in Elo points = 695 * sqrt(4*score*(1-score) - DrawFraction) / sqrt(N)

hgm · Post by **hgm** » Thu Nov 25, 2010 9:20 am

And that is very close to what I had, as for Chess I assume a draw fraction of 1/3. So there would be a multiplier sqrt(0.66) ~ 0.8, which times 695 is about 560.

ernest · Post by **ernest** » Fri Nov 26, 2010 2:00 pm

hgm wrote:as for Chess I assume a draw fraction of 1/3.

That's the major error term of your formula: in engine-engine tests you can often see 2/3 draws.
Then you have to divide your number by sqrt(2)

hgm · Post by **hgm** » Fri Nov 26, 2010 3:28 pm

I have never seen such a high draw fraction in any engine-engine testing I did. For Chess. (In Shogi the draw fraction is of course close to 0%.)

Estimating ELO difference

Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference

Re: Estimating ELO difference