Elo difference and statistical confidence for matches

omid_dt · Post by **omid_dt** » Thu Feb 07, 2008 11:42 pm

Assume we have two programs playing a total of N games against each other, and the final result (from the first program's point of view) is:

W wins, L losses, and D draws (W + L + D = N total games).

Based on these, we would like to calculate the statistical confidence intervals for 68%, 95%, and 99.7% (corresponding to 1, 2, and 3 standard deviations).

The following is how I calculated it. My results are different from those displayed by Fritz interface when running two engine matches. I'd appreciate it if you take a look at what I did and see if there are any errors.

m = (W + D/2) / N

stdev = sqrt( (W*(1-m)^2 + L*m^2 + D*(0.5-m)^2)) / (N - 1) )

In the following calculation, for 68% confidence (1 standard deviation), k=1, for 95% k=2, and for 99.7% k=3:

min = m - k * stdev / sqrt(N);
max = m + k * stdev / sqrt(N);

finally, translate these min, max intervals to Elo points (in the following formula log base is 10):

min_elo = -400 * log(1/min - 1)
max_elo = -400 * log(1/max - 1)

the mean elo is:

tp = -400 * log(1/m - 1)

My TP calculation matches that of Fritz, so it should be correct. However, the Elo ranges differ. For example, for a 840-game match with the following result:

+231 -301 =308

Fritz interface gives the following:

TP = -29 Elo
68% -> [-40, -20]
95% -> [-51, -11]
99.7% -> [-62, -3]

(shouldn't the TP value by at the middle of the [min, max] range? -29 is not the mean of the above ranges...)

For the above match, this is what my calculation produces:

TP = -29 Elo
68% -> [-39, -19]
95% -> [-48, -10]
99.7% -> [-58, -0]

George Tsavdaris · Post by **George Tsavdaris** » Fri Feb 08, 2008 1:42 am

omid_dt wrote: For the above match, this is what my calculation produces:

TP = -29 Elo
68% -> [-39, -19]
95% -> [-48, -10]
99.7% -> [-58, -0]

I get ranges EXACTLY the same as yours with my calculations(probably they are the same, i didn't look at your method since it's late here now), with the only small exception:
99.7% -> [-58, -1] instead of yours 99.7% -> [-58, -0].

I have heard people in the past call Fritz's calculations crap.

Since we don't know how they do it, i guess we should ignore them....

hgm · Post by **hgm** » Fri Feb 08, 2008 9:07 am

It might be that Fritz uses a BayesElo-like technique for calculating the confidence intervals (but not for the difference). What was N in this example? Does the effect get larger when you take very small N? (e.g. a 75% score out of only 2 games compared to a 75% score out of 200 games)?

omid_dt · Post by **omid_dt** » Fri Feb 08, 2008 2:11 pm

hgm wrote:It might be that Fritz uses a BayesElo-like technique for calculating the confidence intervals (but not for the difference). What was N in this example? Does the effect get larger when you take very small N? (e.g. a 75% score out of only 2 games compared to a 75% score out of 200 games)?

N = total number of games. 840 in my example.

For very small N values, Fritz does give an output, while my method does not, since the standard deviation is too large.

For example, for the following 6 games:
+1 -4 = 1

At 1 standard deviation (68%), my calculation gives:
68% -> [-426, -56]

but at 2 standard deviations (95%):

min = m - 2 * stdev / sqrt(n); // n = 6
max = m + 2 * stdev / sqrt(n); // n = 6

the result is min=-0.091 and max=0.59

which cannot be used for calculating respective Elo intervals, since the winning rate should be between 0 and 1.

For even smaller number of games, e.g.:
+1 -0 =1

my calculation can't produce a confidence interval even for 1stdev (68%).

omid_dt · Post by **omid_dt** » Fri Feb 08, 2008 2:22 pm

Another example from a match I'm running right now. The results so far, after 134 games:

+34 -40 =60

and the following are the outputs:

Fritz:

TP = -15 Elo
68% -> [-44, 3]
95% -> [-74, 22]
99.7% -> [-104, 41]

My calc:

TP = -15.6 Elo
68% -> [-38, 7]
95% -> [-61, 29]
99.7% -> [-84, 52]

Uri Blass · Post by **Uri Blass** » Fri Feb 08, 2008 2:56 pm

omid_dt wrote:Assume we have two programs playing a total of N games against each other, and the final result (from the first program's point of view) is:

W wins, L losses, and D draws (W + L + D = N total games).

Based on these, we would like to calculate the statistical confidence intervals for 68%, 95%, and 99.7% (corresponding to 1, 2, and 3 standard deviations).

The following is how I calculated it. My results are different from those displayed by Fritz interface when running two engine matches. I'd appreciate it if you take a look at what I did and see if there are any errors.

m = (W + D/2) / N

stdev = sqrt( (W*(1-m)^2 + L*m^2 + D*(0.5-m)^2)) / (N - 1) )

In the following calculation, for 68% confidence (1 standard deviation), k=1, for 95% k=2, and for 99.7% k=3:

min = m - k * stdev / sqrt(N);
max = m + k * stdev / sqrt(N);

finally, translate these min, max intervals to Elo points (in the following formula log base is 10):

min_elo = -400 * log(1/min - 1)
max_elo = -400 * log(1/max - 1)

the mean elo is:

tp = -400 * log(1/m - 1)

My TP calculation matches that of Fritz, so it should be correct. However, the Elo ranges differ. For example, for a 840-game match with the following result:

+231 -301 =308

Fritz interface gives the following:

TP = -29 Elo
68% -> [-40, -20]
95% -> [-51, -11]
99.7% -> [-62, -3]

(shouldn't the TP value by at the middle of the [min, max] range? -29 is not the mean of the above ranges...)

For the above match, this is what my calculation produces:

TP = -29 Elo
68% -> [-39, -19]
95% -> [-48, -10]
99.7% -> [-58, -0]

I do not know how fritz calculates it but it is clear that your calculation is not correct because you assume that the result has a normal distribution
and claiming that it is normal is only an approximation.

Note that practically the probability of white may be different than the probability of black so you may have 6 different probabilities

1)probability of engine A to win with white
2)probability of engine B to win with white
3)probability of engine A to draw with white
4)probability of engine B to draw with white
5)probability of engine A to lose with white
6)probability of engine B to lose with white

This is not the only problem.

If you try to be more accurate and not play random position but give both programs white and black of the same specific positions then the probabilities are not same in all games.

When you take random position the probability for draw may be 40% when engine A is white but the probability for draw when engine A play some gambit may be only 20%

I practically see no way to caluclate confidence interval without assuming some assumptions that are simply wrong.

Note also that your confidence interval dpes not tell you the probability that the rating is in the interval because you build it based on assumption that you know the rating(otherwise you may get different interval)

The calculation of stdev is based on specific probabilties of W and L and D but you may have different expected result.

As an extreme example imagine case that you play 4 games and all of them are draws D=4 W=0 N=4

You are going to get stdev=0 but it is clearly not correct.
You cannot get the conclusion that you have accurate rating because practically you do not know that every game is going to be a draw.

This mistake is smaller with bigger number of games but it still exists.
I do not suggest correct way to solve the problems and it is clearly easier to show other people why their solution is not exactly correct relatively to the option to suggest correct solution to the problems.

Uri

omid_dt · Post by **omid_dt** » Fri Feb 08, 2008 3:27 pm

Assuming normal distribution may not be totally accurate, but it is widely used in many fields (and has resulted in some horrible results, e.g., the economic model of LTCM hedge fund, which lost billions).

But anyway, in our case, is there a better alternative?

hgm · Post by **hgm** » Fri Feb 08, 2008 3:37 pm

omid_dt wrote:For very small N values, Fritz does give an output, while my method does not, since the standard deviation is too large.

This suggests then that Fritz does not use a normal approximation of the score result, but uses the binomial distribution (or actually tri-nomial, since there are draws) instead. For small N the tails of this distribution can deviate strongly from a Gaussian (as evidenced by the fact that they will never extend beyond the limits 0% or 100%, while a Gaussian in principle is non-zero all the way to infinity). For small N a binomial distribution is also significantly asymmetric.

Best way to handle small number of games remains BayesElo, although there are some issues there too. In particular, when applied to a bunch of players differing wildly in strength, using the default prior of 2 leads to large artifacts and a much compressed rating scale, and it is better to set the prior to zero. But for round-robins between BayesElo with prior=2 is almost perfect.

omid_dt · Post by **omid_dt** » Fri Feb 08, 2008 4:04 pm

Are you aware of a reference that provides an example of applying binomial distribution to this problem?

hgm · Post by **hgm** » Fri Feb 08, 2008 4:59 pm

No, but if I would have to do it, I would simply use the bi-nomial formula

P(W wins out of N) = N!/(W! * L!) * P_win^W * P_loss^L

to calculate the probability for W wins and L losses (W+L=N)with the P_win and P_loss derived from the actual result for each W, and use those numbers to tabulate the cumulative distribution, and interpolate that linearly. And then find the 16% and 84% points (or 2.5% and 97.5%, or ...) on the W axis.

Or, to allow for draws, use

P(W,D,L) = N!/(W!*D!*L!) * P_win^W * P_draw^D * P_loss^L

for all possible combinations of W, D , L with W+D+L=N, and add those combinations that result in the same score S = W+D/2, and calculate the cumulative probability as a function of S.

If the number of games is not extremely small, linear interpolation would be good enough.

Elo difference and statistical confidence for matches

Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches

Re: Elo difference and statistical confidence for matches