Hello:
Ajedrecista wrote:Hello:
I see new posts at
Pull 22 and I completely agree with Marco about the number of games... some thousands are needed due to the small Elo gain expected. I am not an expert on the issue, but I would like to say to Ryan Taker that 85% or 90% of LOS seem low values. If I am not wrong, LOS is a one-sided test that tells the probability of one engine being better than the other with the accumulated data of the match (wins, draws and loses). Given a known percentage value of LOS, the probability (in percentage) of being wrong in the assumptions is min.(LOS, 100% - LOS). A LOS value of 85% means that you will be wrong 15% of the times while a LOS value of 90% means that you will be wrong one out of ten times! It is too much IMHO.
I understand error bars (± Elo) as a two-sided test; I can be wrong, but I match LOS and confidence of error bars as follows:
Code: Select all
In percentage:
Confidence = 2*LOS - 100.
LOS = 50 + confidence/2.
So, LOS = 90% is like 80% confidence and LOS = 85% is like 70% confidence (1-sigma confidence ~ 68.27% confidence in a normal distribution, which is fairly low for engine testing purposes).
The typical 95% confidence value would correspond to LOS = 97.5%; I wrote a Fortan programme some months ago: if you input the number of games, the draw ratio and the desired LOS value, it calculates the minimum score using a model of mean and standard deviation in a normal distribution. In view of lots of selftests between SF in GitHub, I will suppose a draw ratio of 64%. Each calculation took around 17 ms or 18 ms in my computer (indeed, it takes more time (a few seconds) start the programme and input the data that the calculations themselves). I hope no typos:
Code: Select all
Draw ratio = 64%.
MINIMUM NUMBER OF POINTS FOR THE IMPROVED ENGINE:
Games: LOS = 97.5%: LOS = 99%: LOS = 99.5%: LOS = 99.9%:
------ ------------ ---------- ------------ ------------
5000 2542 2549.5 2555 2565.5
6000 3046 3054.5 3060 3072
7000 3549.5 3558.5 3565 3578
8000 4053 4062.5 4069.5 4083
9000 4556 4566.5 4573.5 4588
10000 5059 5070 5077.5 5093
11000 5562 5573.5 5581.5 5597.5
12000 6064.5 6076.5 6085 6102
13000 6567.5 6580 6588.5 6606
14000 7070 7083 7091.5 7110
15000 7572.5 7585.5 7595 7614
16000 8074.5 8088.5 8098 8117.5
I hope that this info will be helpful. Marco said that he will release a new official version if the current development version is improved in 10 Elo or more. Good luck!
Regards from Spain.
Ajedrecista.
Sorry for this long post. Taking a look to
Pull 23 on GitHub, I think that the results of the last test (+2359 -2138 =8887) are better than Gary's thoughts. Gary takes standard deviations without the draw ratio and this decision enlarge error bars, so his error bars are more conservative, IMHO forcing the improved engine to score better than needed. I explain myself a little more: looking at my quote:
Code: Select all
In percentage:
Confidence = 2*LOS - 100.
LOS = 50 + confidence/2.
It is an own assumption (of course I can be wrong!), calling confidence the result of a two-sided test while LOS is a one-sided test. Both Gary and me coincide calculating LOS for this test (LOS ~ 99.95%). According to my thoughts, a confidence interval of 2*99.95 - 100 = 99.9% would bring an Elo interval of more less [0, 2*|error bar|] in favour of the best engine. I take my error bars in this interval, although Gary's error bars are easily related to mine:
Code: Select all
(For a given confidence interval).
If the match is near 50%-50% and lots of games are played (small standard deviations):
|My error bars| ~ sqrt(1 - draw_ratio)*|Gary's error bars|
I do the attempt, writing 99.9% confidence in my programme:
Code: Select all
LOS_and_Elo_uncertainties_calculator, ® 2012.
----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------
(The input and output data is referred to the first engine).
Please write down non-negative integers.
Maximum number of games supported: 2147483647.
Write down the number of wins (up to 1825361100):
2359
Write down the number of loses (up to 1825361100):
2138
Write down the number of draws (up to 2147479150):
8887
Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):
99.9
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
---------------------------------------
Elo interval for 99.90 % confidence:
Elo rating difference: 5.74 Elo
Lower rating difference: 0.01 Elo
Upper rating difference: 11.47 Elo
Lower bound uncertainty: -5.73 Elo
Upper bound uncertainty: 5.73 Elo
Average error: +/- 5.73 Elo
K = (average error)*[sqrt(n)] = 662.66
Elo interval: ] 0.01, 11.47[
---------------------------------------
Number of games of the match: 13384
Score: 50.83 %
Elo rating difference: 5.74 Elo
Draw ratio: 66.40 %
*********************************************************
Standard deviation: 0.8240 % of the points of the match.
*********************************************************
Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.
-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------
LOS (taking into account draws) is always calculated, if possible.
LOS (not taking into account draws) is only calculated if wins + loses < 16001.
LOS (average value) is calculated only when LOS (not taking into account draws) is calculated.
______________________________________________
LOS: 99.95 % (taking into account draws).
LOS: 99.95 % (not taking into account draws).
LOS: 99.95 % (average value).
______________________________________________
These values of LOS are rounded up to 0.01%
End of the calculations. Approximated elapsed time: 79 ms.
Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
In fact, 0.01 is almost 0 (it is due to roundings because LOS is not exactly 99.95% and I must approximate the error function with the Simpson's composite rule) and 11.47/5.73 ~ 2 (the same explanation as before). With 95% confidence, I get ~ ± 3.41 Elo and an Elo interval of ~ (2.33, 9.15).
glinscott wrote:I'll let the modified version keep running for a while, so far it's still within error bars:
Wins: 2359 Losses: 2138 Draws: 8887
LOS: 99.950931%
ELO: 5.711500 +- 99%: 7.753670 95%: 5.889452
Win%: 50.821877 +- 99%: 1.114947 95%: 0.847014
Indeed 3.41/sqrt(1 - draw_ratio) ~ 3.41/sqrt(1 - 8887/13384) ~ 3.41/sqrt(1 - 0.664) ~ 5.88.
IMHO this test says that the improved version is at least 2.33 Elo better than the other version
with 95% confidence under those conditions (self test, time control...). I honestly think that LOS in a match between two engines is a very important value. According to my thoughts again, a LOS value of 99.95% tells that the possibility of the weakest engine in the match being better is min.(99.95%, 100% - 99.95%) ~ 0.05% or around one out of two thousand possibilities if the match is trustable.
The best version scored 2359 + 8887/2 = 6802.5 points in 2359 + 2138 + 8887 = 13384 games; if I use my programme Minimum_score_for_no_regression writing 13384 games and a draw ratio of 66.4%:
Code: Select all
Draw ratio = 66.4%.
MINIMUM NUMBER OF POINTS FOR THE IMPROVED ENGINE:
Games: LOS = 97.5%: LOS = 99%: LOS = 99.5%: LOS = 99.9%:
------ ------------ ---------- ------------ ------------
13384 6758 6770 6778.5 6796
It is only my model although I think that it is enough accurate (without pretending to be arrogant, of course!). In these cases: LOS = {97.5%, 99%, 99.5%, 99.9%} are equivalent to confidence = {95%, 98%, 99%, 99.8%}. After all, IMHO Gary's test shows an Elo gain greater than Elo uncertainties
using my model except for extremely high confidence intervals, obviously. I guess that the accomplished changes in recent pulls are good and bring few Elo which are always welcome. Please keep up your good work! SF 2.2.2 has a rating of 2972 at IPON so I expect that the next version of SF (the last release 2.3 or a future 2.3.1 including all these changes) will reach at least 2985 (of course maintaining Shredder 12 rating at 2800). Good luck!
Regards from Spain.
Ajedrecista.