Error margin

Rebel · Post by **Rebel** » Fri Nov 02, 2012 11:37 pm

I did some research with Elostat and Bayeselo how they calculate the elo error-margin. They use about the same formula. Some snippets.

Code: Select all

GAMES  ERROR
       MARGIN
1000     17
2000     12
4000      9
5000      8
10000     5
15000     4
20000     4
25000     3
30000     3
50000     3
100000    2

Laskos · Post by **Laskos** » Sat Nov 03, 2012 12:18 am

Rebel wrote:I did some research with Elostat and Bayeselo how they calculate the elo error-margin. They use about the same formula. Some snippets.
Code: Select all
GAMES  ERROR
       MARGIN
1000     17
2000     12
4000      9
5000      8
10000     5
15000     4
20000     4
25000     3
30000     3
50000     3
100000    2

Just use 560/sqrt(N_games) for 2SD Elo margins. Or, if you want to be more precise, 700*sqrt(4*win_ratio*(1-win_ratio) - draw_ratio)/sqrt(N_games).

Kai

Ajedrecista · Post by **Ajedrecista** » Sat Nov 03, 2012 1:34 pm

Hi Ed:

The approximation of Kai is very good. I calculated error bars using my method, but with simplifications: I fixed the draw ratio to 32% and I supposed scores of 50% for each engine. I wrote this Fortran 95 code for calculate error bars from 1000 games up to 200000 in intervals of 100 games; I calculated error bars for different confidence intervals (95%, 98%, 99%, 99.5%, 99.8% and 99.9%) and each calculation plus print in notepad took around 125 ms. Here is the code:

Code: Select all

program Error_bars

implicit none 

integer, parameter &#58;&#58; parts = 2000, iterations = 60
integer &#58;&#58; n, i, j
real&#40;KIND=3&#41; &#58;&#58; sigma, S1, S2, x, t0, t1
real&#40;KIND=3&#41; &#58;&#58; error, three_sqrt_of_two_pi, confidence
real&#40;KIND=3&#41; &#58;&#58; a, b, h_a, h2_a, h_b, h2_b, z, h_z, h2_z, S_a, S_b, S_z, S1z, S2z, function_a, function_b, function_z

write&#40;*,*)
write&#40;*,*) 'Write down the confidence level &#40;in percentage&#41; between 65% and 99.9% &#40;it will be rounded up to 0.01%)&#58;'
write&#40;*,*)
read&#40;*,*) confidence  ! Confidence for a two-sided test.
write&#40;*,*)

t0=cpu_clock@()

confidence = 1d-2*nint&#40;1d2*confidence,KIND=3&#41;  ! Rounded up to 0.01%.

if (&#40;confidence < 6.5d1&#41; .or. &#40;confidence > 9.99001d1&#41;)  then
  write&#40;*,'&#40;A&#41;') 'LOS_and_Elo_uncertainties_calculator will not work with a confidence level outside a range of 65% - 99.9%'
  write&#40;*,*)
  write&#40;*,'&#40;A&#41;') 'Please close and try again. Press Enter to exit.'
  read&#40;*,'()')
  stop
end if

if &#40;confidence < 7d1&#41; then  ! It splits into smaller intervals for later doing less iterations; less time is consumed.
  a = 9.345d-1; b = 1.0365d0
  else if (&#40;confidence >= 7d1&#41; .and. &#40;confidence < 7.5d1&#41;) then
  a = 1.0364d0; b = 1.1504d0
  else if (&#40;confidence >= 7.5d1&#41; .and. &#40;confidence < 8d1&#41;) then
  a = 1.1503d0; b = 1.2816d0
  else if (&#40;confidence >= 8d1&#41; .and. &#40;confidence < 8.5d-1&#41;) then
  a = 1.2815d0; b = 1.4396d0
  else if (&#40;confidence >= 8.5d-1&#41; .and. &#40;confidence < 9d1&#41;) then
  a = 1.4395d-1; b = 1.6449d0
  else if (&#40;confidence >= 9d1&#41; .and. &#40;confidence < 9.25d1&#41;) then
  a = 1.6448d0; b = 1.7805d0
  else if (&#40;confidence >= 9.25d1&#41; .and. &#40;confidence < 9.5d1&#41;) then
  a = 1.7804d0; b = 1.96d0
  else if (&#40;confidence >= 9.5d1&#41; .and. &#40;confidence < 9.75d1&#41;) then
  a = 1.9599d0; b = 2.2415d0
  else if (&#40;confidence >= 9.75d1&#41; .and. &#40;confidence < 9.9d1&#41;) then
  a = 2.2414d0; b = 2.5759d0
  else if (&#40;confidence >= 9.9d1&#41; .and. &#40;confidence < 9.95d1&#41;) then
  a = 2.5758d0; b = 2.8071d0
  else if (&#40;confidence >= 9.95d1&#41; .and. &#40;confidence < 9.975d1&#41;) then
  a = 2.807d0; b = 3.0234d0
  else if (&#40;confidence >= 9.975d1&#41; .and. &#40;confidence < 9.984d1&#41;) then
  a = 3.0233d0; b = 3.156d0
  else if &#40;confidence >= 9.984d1&#41; then
  a = 3.1559d0; b = 3.2906d0
end if

three_sqrt_of_two_pi = 3d0*sqrt&#40;2d0*acos&#40;-1d0&#41;)

S_a = 0d0
h_a = a/parts
h2_a = h_a + h_a

x = -h_a
S1 = 0d0
do i = 1, parts-1, 2
  x = x + h2_a
  S1 = S1 + exp&#40;-5d-1*x*x&#41;
end do

x = 0d0
S2 = 0d0
do i = 2, parts-2, 2
  x = x + h2_a
  S2 = S2 + exp&#40;-5d-1*x*x&#41;
end do

S_a = h2_a*&#40;1d0 + 4d0*S1 + 2d0*S2 + exp&#40;-5d-1*a*a&#41;)/three_sqrt_of_two_pi  ! This line prepares a two-sided test.
function_a = S_a - 1d-2*confidence

S_b = 0d0
h_b = b/parts
h2_b = h_b + h_b

x = -h_b
S1 = 0d0
do i = 1, parts-1, 2
  x = x + h2_b
  S1 = S1 + exp&#40;-5d-1*x*x&#41;
end do

x = 0d0
S2 = 0d0
do i = 2, parts-2, 2
  x = x + h2_b
  S2 = S2 + exp&#40;-5d-1*x*x&#41;
end do

S_b = h2_b*&#40;1d0 + 4d0*S1 + 2d0*S2 + exp&#40;-5d-1*b*b&#41;)/three_sqrt_of_two_pi  ! This line prepares a two-sided test.
function_b = S_b - 1d-2*confidence

do j = 1, iterations  ! Solve the parameter z by Regula Falsi method&#58;
  z = a - &#40;b - a&#41;*function_a/&#40;function_b - function_a&#41;
  ! The following is the original&#58;
  ! z = a + &#40;b - a&#41;*abs&#40;function_a&#41;/&#40;abs&#40;function_a&#41; + abs&#40;function_b&#41;)
  ! But given the fact that function_a < 0 and function_b > 0, the other form to calculate z requires less operations, so is faster.
  S_z = 0d0
  h_z = z/parts
  h2_z = h_z + h_z

  x = -h_z
  S1z = 0d0
  do i = 1, parts-1, 2
    x = x + h2_z
    S1z = S1z + exp&#40;-5d-1*x*x&#41;
  end do

  x = 0d0
  S2z = 0d0
  do i = 2, parts-2, 2
    x = x + h2_z
    S2z = S2z + exp&#40;-5d-1*x*x&#41;
  end do

  S_z = h2_z*&#40;1d0 + 4d0*S1z + 2d0*S2z + exp&#40;-5d-1*z*z&#41;)/three_sqrt_of_two_pi  ! This line prepares a two-sided test.
  function_z = S_z - 1d-2*confidence

  if &#40;function_a*function_z < 0d0&#41; then
    b = z
    function_b = function_z
  else if &#40;function_b*function_z < 0d0&#41; then
    a = z
    function_a = function_z
  end if
end do

open&#40;unit=111, file='log.txt', status='unknown', action='write') 
write&#40;111,'&#40;A,F5.2,A&#41;') 'Confidence level&#58; ', confidence, '% confidence.' 
write&#40;111,'&#40;A&#41;') 'Draw ratio&#58; 32% &#40;fixed&#41;.'
write&#40;111,'&#40;A&#41;') 'Simplification&#58; score = 50%.'
write&#40;111,*)
write&#40;111,'&#40;A&#41;') 'Games&#58;      Error bar&#58;'
write&#40;111,*) 

do n = 1000, 200000, 100
  sigma = sqrt&#40;1.7d-1/n&#41;  ! Draw ratio = 32%; score = 50%.
  ! The formula for calculating sigma was taken from this thread &#40;first seen in post #22&#41;&#58; 
  ! http&#58;//immortalchess.net/forum/showthread.php?t=2237

  error = 4d2*log10&#40;&#40;5d-1 + z*sigma&#41;/&#40;5d-1 - z*sigma&#41;)
  error = 1d-2*nint&#40;1d2*error,KIND=3&#41;  ! Rounded up to 0.01 Elo.

  write&#40;111,'&#40;I6,A,F5.2,A,F6.2&#41;') n, '     ± ', error, ' Elo'
end do 

close&#40;111&#41;
 
t1=cpu_clock@()

write&#40;*,'&#40;A,I3,A&#41;') 'End of the calculations. Time&#58; ', nint&#40;&#40;t1-t0&#41;/3d6,KIND=3&#41;, ' ms.'  ! 3 GHz in my PC.
write&#40;*,*)

end program Error_bars

I took a lot of code from other programme (LOS_and_Elo_uncertainties_calculator) that I wrote some months ago.

My results are very similar to yours and also to Kai's results (I rounded them up to 0.01 Elo):

Error_bars_with_simplifications (0.02 MB)

Regards from Spain.

Ajedrecista.

Rebel · Post by **Rebel** » Sun Nov 04, 2012 12:23 am

Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.

There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.

Example:

Code: Select all

 2879-2452-2668 &#40;7999&#41;  match score 4105.0 - 3894.0 &#40;51.3%)

 Won-loss 2879-2668 = 211 &#40;7999 games&#41; draws 30.7%

 LOS = 99.8%  Elo Error Margin +6 -6

 Engine MICKEY &#40;elo 2500&#41; vs Engine MOUSE &#40;elo 2500&#41; estimated TPR 2508 (+8&#41;

Download with source code at: http://www.top-5000.nl/match.htm

lucasart · Post by **lucasart** » Sun Nov 04, 2012 10:45 am

Rebel wrote:Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.

There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.

Example:
Code: Select all
 2879-2452-2668 &#40;7999&#41;  match score 4105.0 - 3894.0 &#40;51.3%)

 Won-loss 2879-2668 = 211 &#40;7999 games&#41; draws 30.7%

 LOS = 99.8%  Elo Error Margin +6 -6

 Engine MICKEY &#40;elo 2500&#41; vs Engine MOUSE &#40;elo 2500&#41; estimated TPR 2508 (+8&#41;
Download with source code at: http://www.top-5000.nl/match.htm

I don't agree with your calculation. From these results, the (unbiaised) empirical mean and stdev are:
* mu = E_hat(Xi) = (X1+...+Xn)/n = (#win + #draw/2) / n = 0.52669083635454
* V_hat(Xi) = [#win.(1-mu)^2 + #loss.(0-mu)^2 + #draw.(.5-mu)^2] / (n-1) = 0.16592291903455
* sigma = stdev(E_hat(Xi)) = sqrt(V_hat(Xi) / n) = 0.00455444373651
* LOS = P(N(mu,sigma) > .5) = P(N(0,1) > (.5-mu)/sigma) = P(N(0,1) < (mu-.5)/sigma) = 0.99999999769115. Don't trust too many decimals there, I used the function normsdist(x) = P(N(0,1)>x) from Gnumeric spreadsheet program. But still more than 99.8%.

Note that I'm using the gaussian approximation of the real distribution (which is a trinomial, rescaled). I haven't calculated the exact multinomial, but I doubt it would be very different.

Ajedrecista · Post by **Ajedrecista** » Sun Nov 04, 2012 11:09 am

Hello Ed and Lucas:

lucasart wrote:
Rebel wrote:Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.

There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.

Example:
Code: Select all
 2879-2452-2668 &#40;7999&#41;  match score 4105.0 - 3894.0 &#40;51.3%)

 Won-loss 2879-2668 = 211 &#40;7999 games&#41; draws 30.7%

 LOS = 99.8%  Elo Error Margin +6 -6

 Engine MICKEY &#40;elo 2500&#41; vs Engine MOUSE &#40;elo 2500&#41; estimated TPR 2508 (+8&#41;
Download with source code at: http://www.top-5000.nl/match.htm
I don't agree with your calculation. From these results, the (unbiaised) empirical mean and stdev are:
* mu = E_hat(Xi) = (X1+...+Xn)/n = (#win + #draw/2) / n = 0.52669083635454
* V_hat(Xi) = [#win.(1-mu)^2 + #loss.(0-mu)^2 + #draw.(.5-mu)^2] / (n-1) = 0.16592291903455
* sigma = stdev(E_hat(Xi)) = sqrt(V_hat(Xi) / n) = 0.00455444373651
* LOS = P(N(mu,sigma) > .5) = P(N(0,1) > (.5-mu)/sigma) = P(N(0,1) < (mu-.5)/sigma) = 0.99999999769115. Don't trust too many decimals there, I used the function normsdist(x) = P(N(0,1)>x) from Gnumeric spreadsheet program. But still more than 99.8%.

Note that I'm using the gaussian approximation of the real distribution (which is a trinomial, rescaled). I haven't calculated the exact multinomial, but I doubt it would be very different.

Please note that Ed wrote wins, draws and loses instead of wins, loses and draws, as you thought. I know it due to the extra info of 4105 points out of 7999: draws = 2*(4105 - 2879) = 2452. Furthermore, Ed wrote:

Code: Select all

Won-loss 2879-2668 = 211 &#40;7999 games&#41; draws 30.7%

Running LOS_and_Elo_uncertainties_calculator with 95% confidence:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines&#58;
----------------------------------------------------------------

&#40;The input and output data is referred to the first engine&#41;.

Please write down non-negative integers.

Maximum number of games supported&#58; 2147483647.

Write down the number of wins &#40;up to 1825361100&#41;&#58;

2879

Write down the number of loses &#40;up to 1825361100&#41;&#58;

2668

Write down the number of draws &#40;up to 2147478100&#41;&#58;

2452

 Write down the confidence level &#40;in percentage&#41; between 65% and 99.9% &#40;it will be rounded up to 0.01%)&#58;

95

Write down the clock rate of the CPU &#40;in GHz&#41;, only for timing the elapsed time of the calculations&#58;

3

---------------------------------------
Elo interval for 95.00 % confidence&#58;

Elo rating difference&#58;      9.17 Elo

Lower rating difference&#58;    2.83 Elo
Upper rating difference&#58;   15.51 Elo

Lower bound uncertainty&#58;   -6.34 Elo
Upper bound uncertainty&#58;    6.35 Elo
Average error&#58;        +/-   6.34 Elo

K = &#40;average error&#41;*&#91;sqrt&#40;n&#41;&#93; =  567.24

Elo interval&#58; &#93;   2.83,   15.51&#91;
---------------------------------------

Number of games of the match&#58;      7999
Score&#58; 51.32 %
Elo rating difference&#58;    9.17 Elo
Draw ratio&#58; 30.65 %

*********************************************************
Standard deviation&#58;  0.9120 % of the points of the match.
*********************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority &#40;LOS&#41; in a one-sided test&#58;
-------------------------------------------------------------------

LOS &#40;taking into account draws&#41; is always calculated, if possible.

LOS &#40;not taking into account draws&#41; is only calculated if wins + loses < 16001.

LOS &#40;average value&#41; is calculated only when LOS &#40;not taking into account draws&#41; is calculated.
______________________________________________

LOS&#58;  99.77 % &#40;taking into account draws&#41;.
LOS&#58;  99.77 % &#40;not taking into account draws&#41;.
LOS&#58;  99.77 % &#40;average value&#41;.
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time&#58;   85 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.

Your calculations are right for the other case (+2879 -2452 =2668) which is not the case that Ed exposed (+2879 -2668 =2452). Please note that when my programme prints 'standard deviation', it is referring to z*sigma, in this case 1.96*sigma more less.

I think that Ed's calculations are good.

Regards from Spain.

Ajedrecista.

Rebel · Post by **Rebel** » Sun Nov 04, 2012 5:07 pm

lucasart wrote:
Rebel wrote:Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.

There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.

Example:
Code: Select all
 2879-2452-2668 &#40;7999&#41;  match score 4105.0 - 3894.0 &#40;51.3%)

 Won-loss 2879-2668 = 211 &#40;7999 games&#41; draws 30.7%

 LOS = 99.8%  Elo Error Margin +6 -6

 Engine MICKEY &#40;elo 2500&#41; vs Engine MOUSE &#40;elo 2500&#41; estimated TPR 2508 (+8&#41;
Download with source code at: http://www.top-5000.nl/match.htm
I don't agree with your calculation. From these results, the (unbiaised) empirical mean and stdev are:
* mu = E_hat(Xi) = (X1+...+Xn)/n = (#win + #draw/2) / n = 0.52669083635454
* V_hat(Xi) = [#win.(1-mu)^2 + #loss.(0-mu)^2 + #draw.(.5-mu)^2] / (n-1) = 0.16592291903455
* sigma = stdev(E_hat(Xi)) = sqrt(V_hat(Xi) / n) = 0.00455444373651
* LOS = P(N(mu,sigma) > .5) = P(N(0,1) > (.5-mu)/sigma) = P(N(0,1) < (mu-.5)/sigma) = 0.99999999769115. Don't trust too many decimals there, I used the function normsdist(x) = P(N(0,1)>x) from Gnumeric spreadsheet program. But still more than 99.8%.

Note that I'm using the gaussian approximation of the real distribution (which is a trinomial, rescaled). I haven't calculated the exact multinomial, but I doubt it would be very different.

Note that the LOS is created without draws. This might do not well after 10 games or 100 games for that matter but so it is for the match score. In the end volume is supposed to weed out most of the randomness and this also counts for the draw ratio as it becomes almost irrelevant.

ernest · Post by **ernest** » Thu Nov 08, 2012 2:11 am

Laskos wrote:for 2SD Elo margins...
700*sqrt(4*win_ratio*(1-win_ratio) - draw_ratio)/sqrt(N_games).

Hi Kai,

I prefer the term "score_ratio" to win_ratio (which could be ambiguous)

Note that when the engines are of similar strength (score_ratio close to 0.5, or say 40% to 60%),
4*score_ratio*(1-score_ratio) is close to 1 and you get:

2SD_ratio = sqrt(1 - draw_ratio)/sqrt(N_games)
and of course
2SD Elo margin = 700*sqrt(1 - draw_ratio)/sqrt(N_games)

...hence your 560/sqrt(N_games) corresponds to draw_ratios close to 36%

hgm · Post by **hgm** » Thu Nov 08, 2012 11:03 am

Yes, that is a good approximation. I always use that too, in the form SD = 40%/sqrt(N). Which amounts to the same if you realize that 1% = 7Elo, and the 95% confidence interval is 2*SD. (40*2*7 = 560). I usually think in terms of percentages rather than Elo, because the percentages is what I directly see from the match results, and I am usually interested to know how many points on that match-result are uncertainty (0.4*sqrt(N)).

Error margin

Error margin

Re: Error margin

Re: Error margin

Re: Error margin

Re: Error margin

Re: Error margin.

Re: Error margin

Re: Error margin

Re: Error margin