Elo uncertainties calculator.

gladius · Post by **gladius** » Tue Sep 04, 2012 6:33 pm

Ajedrecista wrote:Where 'score' is the number you call 'win' (not in percentage) and 'n' is the number of games. We agree that a confidence interval of 99% in a normal distribution is more less mu ± 2.5758*sigma; OTOH a confidence interval of 95% is more less mu ± 1.96*sigma, and here is the mistake that I report: I do the calculations with the parameter z ~ 1.96 and I got different results from you... but I obtain that you are using z ~ 1.69, which is a typo when you did your programme. Please revise it or let me know if I am wrong elsewhere. I compare my results (using the draw ratio) with Marco's results and we get very similar results (I guess that his confidence interval was 95%), so everything seem OK.

Great catch! It was indeed wrong. I had just hacked up a little python script to show the errors bars from cutechess, and as you said, typoed the 95% confidence

.

Ajedrecista · Post by **Ajedrecista** » Sat Sep 15, 2012 12:02 pm

Hello:

I see new posts at Pull 22 and I completely agree with Marco about the number of games... some thousands are needed due to the small Elo gain expected. I am not an expert on the issue, but I would like to say to Ryan Taker that 85% or 90% of LOS seem low values. If I am not wrong, LOS is a one-sided test that tells the probability of one engine being better than the other with the accumulated data of the match (wins, draws and loses). Given a known percentage value of LOS, the probability (in percentage) of being wrong in the assumptions is min.(LOS, 100% - LOS). A LOS value of 85% means that you will be wrong 15% of the times while a LOS value of 90% means that you will be wrong one out of ten times! It is too much IMHO.

I understand error bars (± Elo) as a two-sided test; I can be wrong, but I match LOS and confidence of error bars as follows:

Code: Select all

In percentage:

Confidence = 2*LOS - 100.
LOS = 50 + confidence/2.

So, LOS = 90% is like 80% confidence and LOS = 85% is like 70% confidence (1-sigma confidence ~ 68.27% confidence in a normal distribution, which is fairly low for engine testing purposes).

The typical 95% confidence value would correspond to LOS = 97.5%; I wrote a Fortan programme some months ago: if you input the number of games, the draw ratio and the desired LOS value, it calculates the minimum score using a model of mean and standard deviation in a normal distribution. In view of lots of selftests between SF in GitHub, I will suppose a draw ratio of 64%. Each calculation took around 17 ms or 18 ms in my computer (indeed, it takes more time (a few seconds) start the programme and input the data that the calculations themselves). I hope no typos:

Code: Select all

                                 Draw ratio = 64%.

                 MINIMUM NUMBER OF POINTS FOR THE IMPROVED ENGINE:

Games:     LOS = 97.5%:     LOS = 99%:     LOS = 99.5%:     LOS = 99.9%:
------     ------------     ----------     ------------     ------------
 5000         2542            2549.5          2555             2565.5
 6000         3046            3054.5          3060             3072
 7000         3549.5          3558.5          3565             3578
 8000         4053            4062.5          4069.5           4083
 9000         4556            4566.5          4573.5           4588
10000         5059            5070            5077.5           5093
11000         5562            5573.5          5581.5           5597.5
12000         6064.5          6076.5          6085             6102
13000         6567.5          6580            6588.5           6606
14000         7070            7083            7091.5           7110
15000         7572.5          7585.5          7595             7614
16000         8074.5          8088.5          8098             8117.5

I hope that this info will be helpful. Marco said that he will release a new official version if the current development version is improved in 10 Elo or more. Good luck!

Regards from Spain.

Ajedrecista.

Ajedrecista · Post by **Ajedrecista** » Wed Sep 19, 2012 7:21 pm

Hello:

Ajedrecista wrote:Hello:

I see new posts at Pull 22 and I completely agree with Marco about the number of games... some thousands are needed due to the small Elo gain expected. I am not an expert on the issue, but I would like to say to Ryan Taker that 85% or 90% of LOS seem low values. If I am not wrong, LOS is a one-sided test that tells the probability of one engine being better than the other with the accumulated data of the match (wins, draws and loses). Given a known percentage value of LOS, the probability (in percentage) of being wrong in the assumptions is min.(LOS, 100% - LOS). A LOS value of 85% means that you will be wrong 15% of the times while a LOS value of 90% means that you will be wrong one out of ten times! It is too much IMHO.

I understand error bars (± Elo) as a two-sided test; I can be wrong, but I match LOS and confidence of error bars as follows:
Code: Select all
In percentage:

Confidence = 2*LOS - 100.
LOS = 50 + confidence/2.
So, LOS = 90% is like 80% confidence and LOS = 85% is like 70% confidence (1-sigma confidence ~ 68.27% confidence in a normal distribution, which is fairly low for engine testing purposes).

The typical 95% confidence value would correspond to LOS = 97.5%; I wrote a Fortan programme some months ago: if you input the number of games, the draw ratio and the desired LOS value, it calculates the minimum score using a model of mean and standard deviation in a normal distribution. In view of lots of selftests between SF in GitHub, I will suppose a draw ratio of 64%. Each calculation took around 17 ms or 18 ms in my computer (indeed, it takes more time (a few seconds) start the programme and input the data that the calculations themselves). I hope no typos:
Code: Select all
                                 Draw ratio = 64%.

                 MINIMUM NUMBER OF POINTS FOR THE IMPROVED ENGINE:

Games:     LOS = 97.5%:     LOS = 99%:     LOS = 99.5%:     LOS = 99.9%:
------     ------------     ----------     ------------     ------------
 5000         2542            2549.5          2555             2565.5
 6000         3046            3054.5          3060             3072
 7000         3549.5          3558.5          3565             3578
 8000         4053            4062.5          4069.5           4083
 9000         4556            4566.5          4573.5           4588
10000         5059            5070            5077.5           5093
11000         5562            5573.5          5581.5           5597.5
12000         6064.5          6076.5          6085             6102
13000         6567.5          6580            6588.5           6606
14000         7070            7083            7091.5           7110
15000         7572.5          7585.5          7595             7614
16000         8074.5          8088.5          8098             8117.5
I hope that this info will be helpful. Marco said that he will release a new official version if the current development version is improved in 10 Elo or more. Good luck!

Regards from Spain.

Ajedrecista.

Sorry for this long post. Taking a look to Pull 23 on GitHub, I think that the results of the last test (+2359 -2138 =8887) are better than Gary's thoughts. Gary takes standard deviations without the draw ratio and this decision enlarge error bars, so his error bars are more conservative, IMHO forcing the improved engine to score better than needed. I explain myself a little more: looking at my quote:

Code: Select all

In percentage: 

Confidence = 2*LOS - 100. 
LOS = 50 + confidence/2.

It is an own assumption (of course I can be wrong!), calling confidence the result of a two-sided test while LOS is a one-sided test. Both Gary and me coincide calculating LOS for this test (LOS ~ 99.95%). According to my thoughts, a confidence interval of 2*99.95 - 100 = 99.9% would bring an Elo interval of more less [0, 2*|error bar|] in favour of the best engine. I take my error bars in this interval, although Gary's error bars are easily related to mine:

Code: Select all

(For a given confidence interval).

If the match is near 50%-50% and lots of games are played (small standard deviations):

|My error bars| ~  sqrt(1 - draw_ratio)*|Gary's error bars|

I do the attempt, writing 99.9% confidence in my programme:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------

(The input and output data is referred to the first engine).

Please write down non-negative integers.

Maximum number of games supported: 2147483647.

Write down the number of wins (up to 1825361100):

2359

Write down the number of loses (up to 1825361100):

2138

Write down the number of draws (up to 2147479150):

8887

 Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):

99.9

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3

---------------------------------------
Elo interval for 99.90 % confidence:

Elo rating difference:      5.74 Elo

Lower rating difference:    0.01 Elo
Upper rating difference:   11.47 Elo

Lower bound uncertainty:   -5.73 Elo
Upper bound uncertainty:    5.73 Elo
Average error:        +/-   5.73 Elo

K = (average error)*[sqrt(n)] =  662.66

Elo interval: ]   0.01,   11.47[
---------------------------------------

Number of games of the match:     13384
Score: 50.83 %
Elo rating difference:    5.74 Elo
Draw ratio: 66.40 %

*********************************************************
Standard deviation:  0.8240 % of the points of the match.
*********************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------

LOS (taking into account draws) is always calculated, if possible.

LOS (not taking into account draws) is only calculated if wins + loses < 16001.

LOS (average value) is calculated only when LOS (not taking into account draws) is calculated.
______________________________________________

LOS:  99.95 % (taking into account draws).
LOS:  99.95 % (not taking into account draws).
LOS:  99.95 % (average value).
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time:   79 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.

In fact, 0.01 is almost 0 (it is due to roundings because LOS is not exactly 99.95% and I must approximate the error function with the Simpson's composite rule) and 11.47/5.73 ~ 2 (the same explanation as before). With 95% confidence, I get ~ ± 3.41 Elo and an Elo interval of ~ (2.33, 9.15).

glinscott wrote:I'll let the modified version keep running for a while, so far it's still within error bars:
Wins: 2359 Losses: 2138 Draws: 8887
LOS: 99.950931%
ELO: 5.711500 +- 99%: 7.753670 95%: 5.889452
Win%: 50.821877 +- 99%: 1.114947 95%: 0.847014

Indeed 3.41/sqrt(1 - draw_ratio) ~ 3.41/sqrt(1 - 8887/13384) ~ 3.41/sqrt(1 - 0.664) ~ 5.88.

IMHO this test says that the improved version is at least 2.33 Elo better than the other version with 95% confidence under those conditions (self test, time control...). I honestly think that LOS in a match between two engines is a very important value. According to my thoughts again, a LOS value of 99.95% tells that the possibility of the weakest engine in the match being better is min.(99.95%, 100% - 99.95%) ~ 0.05% or around one out of two thousand possibilities if the match is trustable.

The best version scored 2359 + 8887/2 = 6802.5 points in 2359 + 2138 + 8887 = 13384 games; if I use my programme Minimum_score_for_no_regression writing 13384 games and a draw ratio of 66.4%:

Code: Select all

                                Draw ratio = 66.4%. 

                 MINIMUM NUMBER OF POINTS FOR THE IMPROVED ENGINE: 

Games:     LOS = 97.5%:     LOS = 99%:     LOS = 99.5%:     LOS = 99.9%: 
------     ------------     ----------     ------------     ------------ 
13384          6758            6770           6778.5            6796

It is only my model although I think that it is enough accurate (without pretending to be arrogant, of course!). In these cases: LOS = {97.5%, 99%, 99.5%, 99.9%} are equivalent to confidence = {95%, 98%, 99%, 99.8%}. After all, IMHO Gary's test shows an Elo gain greater than Elo uncertainties using my model except for extremely high confidence intervals, obviously. I guess that the accomplished changes in recent pulls are good and bring few Elo which are always welcome. Please keep up your good work! SF 2.2.2 has a rating of 2972 at IPON so I expect that the next version of SF (the last release 2.3 or a future 2.3.1 including all these changes) will reach at least 2985 (of course maintaining Shredder 12 rating at 2800). Good luck!

Regards from Spain.

Ajedrecista.

Elo uncertainties calculator.

Re: SF pull request #22 in GitHub: for Gary Linscott.

Re: SF pull request #22 in GitHub.

Re: SF pull request #23 in GitHub: about error bars and LOS.