Code: Select all
GAMES ERROR
MARGIN
1000 17
2000 12
4000 9
5000 8
10000 5
15000 4
20000 4
25000 3
30000 3
50000 3
100000 2
Moderators: hgm, Rebel, chrisw
Code: Select all
GAMES ERROR
MARGIN
1000 17
2000 12
4000 9
5000 8
10000 5
15000 4
20000 4
25000 3
30000 3
50000 3
100000 2
Just use 560/sqrt(N_games) for 2SD Elo margins. Or, if you want to be more precise, 700*sqrt(4*win_ratio*(1-win_ratio) - draw_ratio)/sqrt(N_games).Rebel wrote:I did some research with Elostat and Bayeselo how they calculate the elo error-margin. They use about the same formula. Some snippets.
Code: Select all
GAMES ERROR MARGIN 1000 17 2000 12 4000 9 5000 8 10000 5 15000 4 20000 4 25000 3 30000 3 50000 3 100000 2
Code: Select all
program Error_bars
implicit none
integer, parameter :: parts = 2000, iterations = 60
integer :: n, i, j
real(KIND=3) :: sigma, S1, S2, x, t0, t1
real(KIND=3) :: error, three_sqrt_of_two_pi, confidence
real(KIND=3) :: a, b, h_a, h2_a, h_b, h2_b, z, h_z, h2_z, S_a, S_b, S_z, S1z, S2z, function_a, function_b, function_z
write(*,*)
write(*,*) 'Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):'
write(*,*)
read(*,*) confidence ! Confidence for a two-sided test.
write(*,*)
t0=cpu_clock@()
confidence = 1d-2*nint(1d2*confidence,KIND=3) ! Rounded up to 0.01%.
if ((confidence < 6.5d1) .or. (confidence > 9.99001d1)) then
write(*,'(A)') 'LOS_and_Elo_uncertainties_calculator will not work with a confidence level outside a range of 65% - 99.9%'
write(*,*)
write(*,'(A)') 'Please close and try again. Press Enter to exit.'
read(*,'()')
stop
end if
if (confidence < 7d1) then ! It splits into smaller intervals for later doing less iterations; less time is consumed.
a = 9.345d-1; b = 1.0365d0
else if ((confidence >= 7d1) .and. (confidence < 7.5d1)) then
a = 1.0364d0; b = 1.1504d0
else if ((confidence >= 7.5d1) .and. (confidence < 8d1)) then
a = 1.1503d0; b = 1.2816d0
else if ((confidence >= 8d1) .and. (confidence < 8.5d-1)) then
a = 1.2815d0; b = 1.4396d0
else if ((confidence >= 8.5d-1) .and. (confidence < 9d1)) then
a = 1.4395d-1; b = 1.6449d0
else if ((confidence >= 9d1) .and. (confidence < 9.25d1)) then
a = 1.6448d0; b = 1.7805d0
else if ((confidence >= 9.25d1) .and. (confidence < 9.5d1)) then
a = 1.7804d0; b = 1.96d0
else if ((confidence >= 9.5d1) .and. (confidence < 9.75d1)) then
a = 1.9599d0; b = 2.2415d0
else if ((confidence >= 9.75d1) .and. (confidence < 9.9d1)) then
a = 2.2414d0; b = 2.5759d0
else if ((confidence >= 9.9d1) .and. (confidence < 9.95d1)) then
a = 2.5758d0; b = 2.8071d0
else if ((confidence >= 9.95d1) .and. (confidence < 9.975d1)) then
a = 2.807d0; b = 3.0234d0
else if ((confidence >= 9.975d1) .and. (confidence < 9.984d1)) then
a = 3.0233d0; b = 3.156d0
else if (confidence >= 9.984d1) then
a = 3.1559d0; b = 3.2906d0
end if
three_sqrt_of_two_pi = 3d0*sqrt(2d0*acos(-1d0))
S_a = 0d0
h_a = a/parts
h2_a = h_a + h_a
x = -h_a
S1 = 0d0
do i = 1, parts-1, 2
x = x + h2_a
S1 = S1 + exp(-5d-1*x*x)
end do
x = 0d0
S2 = 0d0
do i = 2, parts-2, 2
x = x + h2_a
S2 = S2 + exp(-5d-1*x*x)
end do
S_a = h2_a*(1d0 + 4d0*S1 + 2d0*S2 + exp(-5d-1*a*a))/three_sqrt_of_two_pi ! This line prepares a two-sided test.
function_a = S_a - 1d-2*confidence
S_b = 0d0
h_b = b/parts
h2_b = h_b + h_b
x = -h_b
S1 = 0d0
do i = 1, parts-1, 2
x = x + h2_b
S1 = S1 + exp(-5d-1*x*x)
end do
x = 0d0
S2 = 0d0
do i = 2, parts-2, 2
x = x + h2_b
S2 = S2 + exp(-5d-1*x*x)
end do
S_b = h2_b*(1d0 + 4d0*S1 + 2d0*S2 + exp(-5d-1*b*b))/three_sqrt_of_two_pi ! This line prepares a two-sided test.
function_b = S_b - 1d-2*confidence
do j = 1, iterations ! Solve the parameter z by Regula Falsi method:
z = a - (b - a)*function_a/(function_b - function_a)
! The following is the original:
! z = a + (b - a)*abs(function_a)/(abs(function_a) + abs(function_b))
! But given the fact that function_a < 0 and function_b > 0, the other form to calculate z requires less operations, so is faster.
S_z = 0d0
h_z = z/parts
h2_z = h_z + h_z
x = -h_z
S1z = 0d0
do i = 1, parts-1, 2
x = x + h2_z
S1z = S1z + exp(-5d-1*x*x)
end do
x = 0d0
S2z = 0d0
do i = 2, parts-2, 2
x = x + h2_z
S2z = S2z + exp(-5d-1*x*x)
end do
S_z = h2_z*(1d0 + 4d0*S1z + 2d0*S2z + exp(-5d-1*z*z))/three_sqrt_of_two_pi ! This line prepares a two-sided test.
function_z = S_z - 1d-2*confidence
if (function_a*function_z < 0d0) then
b = z
function_b = function_z
else if (function_b*function_z < 0d0) then
a = z
function_a = function_z
end if
end do
open(unit=111, file='log.txt', status='unknown', action='write')
write(111,'(A,F5.2,A)') 'Confidence level: ', confidence, '% confidence.'
write(111,'(A)') 'Draw ratio: 32% (fixed).'
write(111,'(A)') 'Simplification: score = 50%.'
write(111,*)
write(111,'(A)') 'Games: Error bar:'
write(111,*)
do n = 1000, 200000, 100
sigma = sqrt(1.7d-1/n) ! Draw ratio = 32%; score = 50%.
! The formula for calculating sigma was taken from this thread (first seen in post #22):
! http://immortalchess.net/forum/showthread.php?t=2237
error = 4d2*log10((5d-1 + z*sigma)/(5d-1 - z*sigma))
error = 1d-2*nint(1d2*error,KIND=3) ! Rounded up to 0.01 Elo.
write(111,'(I6,A,F5.2,A,F6.2)') n, ' ± ', error, ' Elo'
end do
close(111)
t1=cpu_clock@()
write(*,'(A,I3,A)') 'End of the calculations. Time: ', nint((t1-t0)/3d6,KIND=3), ' ms.' ! 3 GHz in my PC.
write(*,*)
end program Error_bars
Code: Select all
2879-2452-2668 (7999) match score 4105.0 - 3894.0 (51.3%)
Won-loss 2879-2668 = 211 (7999 games) draws 30.7%
LOS = 99.8% Elo Error Margin +6 -6
Engine MICKEY (elo 2500) vs Engine MOUSE (elo 2500) estimated TPR 2508 (+8)
I don't agree with your calculation. From these results, the (unbiaised) empirical mean and stdev are:Rebel wrote:Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.
There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.
Example:
Download with source code at: http://www.top-5000.nl/match.htmCode: Select all
2879-2452-2668 (7999) match score 4105.0 - 3894.0 (51.3%) Won-loss 2879-2668 = 211 (7999 games) draws 30.7% LOS = 99.8% Elo Error Margin +6 -6 Engine MICKEY (elo 2500) vs Engine MOUSE (elo 2500) estimated TPR 2508 (+8)
Please note that Ed wrote wins, draws and loses instead of wins, loses and draws, as you thought. I know it due to the extra info of 4105 points out of 7999: draws = 2*(4105 - 2879) = 2452. Furthermore, Ed wrote:lucasart wrote:I don't agree with your calculation. From these results, the (unbiaised) empirical mean and stdev are:Rebel wrote:Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.
There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.
Example:
Download with source code at: http://www.top-5000.nl/match.htmCode: Select all
2879-2452-2668 (7999) match score 4105.0 - 3894.0 (51.3%) Won-loss 2879-2668 = 211 (7999 games) draws 30.7% LOS = 99.8% Elo Error Margin +6 -6 Engine MICKEY (elo 2500) vs Engine MOUSE (elo 2500) estimated TPR 2508 (+8)
* mu = E_hat(Xi) = (X1+...+Xn)/n = (#win + #draw/2) / n = 0.52669083635454
* V_hat(Xi) = [#win.(1-mu)^2 + #loss.(0-mu)^2 + #draw.(.5-mu)^2] / (n-1) = 0.16592291903455
* sigma = stdev(E_hat(Xi)) = sqrt(V_hat(Xi) / n) = 0.00455444373651
* LOS = P(N(mu,sigma) > .5) = P(N(0,1) > (.5-mu)/sigma) = P(N(0,1) < (mu-.5)/sigma) = 0.99999999769115. Don't trust too many decimals there, I used the function normsdist(x) = P(N(0,1)>x) from Gnumeric spreadsheet program. But still more than 99.8%.
Note that I'm using the gaussian approximation of the real distribution (which is a trinomial, rescaled). I haven't calculated the exact multinomial, but I doubt it would be very different.
Code: Select all
Won-loss 2879-2668 = 211 (7999 games) draws 30.7%
Code: Select all
LOS_and_Elo_uncertainties_calculator, ® 2012.
----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------
(The input and output data is referred to the first engine).
Please write down non-negative integers.
Maximum number of games supported: 2147483647.
Write down the number of wins (up to 1825361100):
2879
Write down the number of loses (up to 1825361100):
2668
Write down the number of draws (up to 2147478100):
2452
Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):
95
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
---------------------------------------
Elo interval for 95.00 % confidence:
Elo rating difference: 9.17 Elo
Lower rating difference: 2.83 Elo
Upper rating difference: 15.51 Elo
Lower bound uncertainty: -6.34 Elo
Upper bound uncertainty: 6.35 Elo
Average error: +/- 6.34 Elo
K = (average error)*[sqrt(n)] = 567.24
Elo interval: ] 2.83, 15.51[
---------------------------------------
Number of games of the match: 7999
Score: 51.32 %
Elo rating difference: 9.17 Elo
Draw ratio: 30.65 %
*********************************************************
Standard deviation: 0.9120 % of the points of the match.
*********************************************************
Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.
-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------
LOS (taking into account draws) is always calculated, if possible.
LOS (not taking into account draws) is only calculated if wins + loses < 16001.
LOS (average value) is calculated only when LOS (not taking into account draws) is calculated.
______________________________________________
LOS: 99.77 % (taking into account draws).
LOS: 99.77 % (not taking into account draws).
LOS: 99.77 % (average value).
______________________________________________
These values of LOS are rounded up to 0.01%
End of the calculations. Approximated elapsed time: 85 ms.
Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
Note that the LOS is created without draws. This might do not well after 10 games or 100 games for that matter but so it is for the match score. In the end volume is supposed to weed out most of the randomness and this also counts for the draw ratio as it becomes almost irrelevant.lucasart wrote:I don't agree with your calculation. From these results, the (unbiaised) empirical mean and stdev are:Rebel wrote:Thanks for the help. I also think Kai's formula is very good, at least good enough for its purpose which is the MATCH utility I am working on.
There is a new version that besides the LOS now also displays the elo error margin and an estimated elo performance.
Example:
Download with source code at: http://www.top-5000.nl/match.htmCode: Select all
2879-2452-2668 (7999) match score 4105.0 - 3894.0 (51.3%) Won-loss 2879-2668 = 211 (7999 games) draws 30.7% LOS = 99.8% Elo Error Margin +6 -6 Engine MICKEY (elo 2500) vs Engine MOUSE (elo 2500) estimated TPR 2508 (+8)
* mu = E_hat(Xi) = (X1+...+Xn)/n = (#win + #draw/2) / n = 0.52669083635454
* V_hat(Xi) = [#win.(1-mu)^2 + #loss.(0-mu)^2 + #draw.(.5-mu)^2] / (n-1) = 0.16592291903455
* sigma = stdev(E_hat(Xi)) = sqrt(V_hat(Xi) / n) = 0.00455444373651
* LOS = P(N(mu,sigma) > .5) = P(N(0,1) > (.5-mu)/sigma) = P(N(0,1) < (mu-.5)/sigma) = 0.99999999769115. Don't trust too many decimals there, I used the function normsdist(x) = P(N(0,1)>x) from Gnumeric spreadsheet program. But still more than 99.8%.
Note that I'm using the gaussian approximation of the real distribution (which is a trinomial, rescaled). I haven't calculated the exact multinomial, but I doubt it would be very different.
Hi Kai,Laskos wrote:for 2SD Elo margins...
700*sqrt(4*win_ratio*(1-win_ratio) - draw_ratio)/sqrt(N_games).