Blitz 10s per game, ponder ON
1 Critter 1.6 64bit +2 +54/=93/53 50.25% 100.5/200
2 Critter 1.4a 64bit SSE4 2 +53/=93/54 49.75% 99.5/200
Blitz 10s per game, ponder OFF
1 Critter 1.6 64bit +23 +56/=101/43 53.25% 106.5/200
2 Critter 1.4a 64bit SSE4 23 +43/=101/56 46.75% 93.5/200
Games played at chess960
Fritz 13 GUI
1 core used
i7 980x
windows 7
Best Regards
Critter 1.6  Critter 1.4a ponder ON/OFF
Moderators: hgm, Rebel, chrisw

 Posts: 2042
 Joined: Wed Mar 08, 2006 8:30 pm
Re: Critter 1.6  Critter 1.4a ponder ON/OFF
95% error bar is ±34 EloMM wrote:Code: Select all
1 Critter 1.6 64bit +23 +56/=101/43 53.25% 106.5/200 2 Critter 1.4a 64bit SSE4 23 +43/=101/56 46.75% 93.5/200

 Posts: 766
 Joined: Sun Oct 16, 2011 11:25 am
Re: Critter 1.6  Critter 1.4a ponder ON/OFF
Hi all,
just for the record.
utentePC, Blitz 1m ponder ON, 1 core, 3,33 ghz, no tablebases.
1 Critter 1.4a 64bit SSE4 +110/=280/110 50.00% 250.0/500 3036.00
2 Critter 1.6 64bit +110/=280/110 50.00% 250.0/500 3036.00
Best Regards
just for the record.
utentePC, Blitz 1m ponder ON, 1 core, 3,33 ghz, no tablebases.
1 Critter 1.4a 64bit SSE4 +110/=280/110 50.00% 250.0/500 3036.00
2 Critter 1.6 64bit +110/=280/110 50.00% 250.0/500 3036.00
Best Regards
MM

 Posts: 766
 Joined: Sun Oct 16, 2011 11:25 am
Re: Critter 1.6  Critter 1.4a ponder ON/OFF
MM wrote:Hi all,
just for the record.
utentePC, Blitz 1m ponder ON, 1 core, 3,33 ghz, no tablebases.
1 Critter 1.4a 64bit SSE4 +110/=280/110 50.00% 250.0/500 3036.00
2 Critter 1.6 64bit +110/=280/110 50.00% 250.0/500 3036.00
Best Regards
utentePC, Blitz 1m ponder OFF
1 Critter 1.6 64bit +23 +125/=283/92 53.30% 266.5/500
2 Critter 1.4a 64bit SSE4 23 +92/=283/125 46.70% 233.5/500
MM

 Posts: 2042
 Joined: Wed Mar 08, 2006 8:30 pm
Re: Critter 1.6  Critter 1.4a ponder ON/OFF
95% error bar is now ±21 Elo, which means that there is more than 95% probability that with ponder OFF, Critter 1.6 is stronger than Critter 1.4a SSE4MM wrote:Code: Select all
1 Critter 1.6 64bit +23 +125/=283/92 53.30% 266.5/500 2 Critter 1.4a 64bit SSE4 23 +92/=283/125 46.70% 233.5/500
However, it cannot yet be said that the (ponder OFF) and (ponder ON) distributions are distinct, with 95% probability, the global result (summation ON+OFF) being:
Code: Select all
1 Critter 1.6 64bit +15 +235/=563/202 51.65% 516.5/1000
2 Critter 1.4a 64bit SSE4 15 +202/=563/235 48.35% 483.5/1000

 Posts: 1988
 Joined: Wed Jul 13, 2011 9:04 pm
 Location: Madrid, Spain.
Re: Critter 1.6  Critter 1.4a, ponder ON/OFF.
Hello Ernest:
It seems that my programme more less agrees with BayesElo in the first match, which is an achievement! However, in the second match, ratting difference is ~ 11.5 Elo, not 15 Elo... which is more likely the error bar. I notice that you are adding two 500game matches, so I do not know if I can simply use +235 =563 202 of one 1000game match or not.
Regards from Spain.
Ajedrecista.
I suppose that all these error bars were obtained with the great BayesElo. I ran my own small programme, just to compare my results:ernest wrote:95% error bar is now ±21 Elo, which means that there is more than 95% probability that with ponder OFF, Critter 1.6 is stronger than Critter 1.4a SSE4MM wrote:Code: Select all
1 Critter 1.6 64bit +23 +125/=283/92 53.30% 266.5/500 2 Critter 1.4a 64bit SSE4 23 +92/=283/125 46.70% 233.5/500
However, it cannot yet be said that the (ponder OFF) and (ponder ON) distributions are distinct, with 95% probability, the global result (summation ON+OFF) being:This means that it cannot be said (with 95% probability), as of yet, that compared to Critter 1.4a SSE4, Critter 1.6 performs better with ponder OFF than with Ponder ON.Code: Select all
1 Critter 1.6 64bit +15 +235/=563/202 51.65% 516.5/1000 2 Critter 1.4a 64bit SSE4 15 +202/=563/235 48.35% 483.5/1000
Code: Select all
LOS_and_Elo_uncertainties_calculator, ® 2012.

Calculation of Elo uncertainties in a match between two engines:

(The input and output data is referred to the first engine).
Please write down nonnegative integers.
Write down the number of wins:
125
Write down the number of loses:
92
Write down the number of draws:
283
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
***************************************
1sigma confidence ~ 68.27% confidence.
2sigma confidence ~ 95.45% confidence.
3sigma confidence ~ 99.73% confidence.
***************************************

Elo interval for 1sigma confidence:
Elo rating difference: 22.96 Elo
Lower rating difference: 12.75 Elo
Upper rating difference: 33.22 Elo
Lower bound uncertainty: 10.21 Elo
Upper bound uncertainty: 10.25 Elo
Average error: +/ 10.23 Elo
K = (average error)*[sqrt(n)] = 228.80
Elo interval: ] 12.75, 33.22[

Elo interval for 2sigma confidence:
Elo rating difference: 22.96 Elo
Lower rating difference: 2.56 Elo
Upper rating difference: 43.53 Elo
Lower bound uncertainty: 20.40 Elo
Upper bound uncertainty: 20.56 Elo
Average error: +/ 20.48 Elo
K = (average error)*[sqrt(n)] = 458.00
Elo interval: ] 2.56, 43.53[

Elo interval for 3sigma confidence:
Elo rating difference: 22.96 Elo
Lower rating difference: 7.62 Elo
Upper rating difference: 53.91 Elo
Lower bound uncertainty: 30.59 Elo
Upper bound uncertainty: 30.95 Elo
Average error: +/ 30.77 Elo
K = (average error)*[sqrt(n)] = 688.01
Elo interval: ] 7.62, 53.91[

Number of games of the match: 500
Score: 53.30 %
Elo rating difference: 22.96 Elo
Draw ratio: 56.60 %
**********************************************
1 sigma: 1.4657 % of the points of the match.
2 sigma: 2.9314 % of the points of the match.
3 sigma: 4.3970 % of the points of the match.
**********************************************
Error bars were calculated with twosided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

Calculation of likelihood of superiority (LOS) in a onesided test:

LOS: 98.78 %
This value of LOS is rounded up to 0.01%
End of the calculations. Approximated elapsed time: 50 ms.
Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
Code: Select all
LOS_and_Elo_uncertainties_calculator, ® 2012.

Calculation of Elo uncertainties in a match between two engines:

(The input and output data is referred to the first engine).
Please write down nonnegative integers.
Write down the number of wins:
235
Write down the number of loses:
202
Write down the number of draws:
563
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
***************************************
1sigma confidence ~ 68.27% confidence.
2sigma confidence ~ 95.45% confidence.
3sigma confidence ~ 99.73% confidence.
***************************************

Elo interval for 1sigma confidence:
Elo rating difference: 11.47 Elo
Lower rating difference: 4.21 Elo
Upper rating difference: 18.74 Elo
Lower bound uncertainty: 7.26 Elo
Upper bound uncertainty: 7.27 Elo
Average error: +/ 7.26 Elo
K = (average error)*[sqrt(n)] = 229.67
Elo interval: ] 4.21, 18.74[

Elo interval for 2sigma confidence:
Elo rating difference: 11.47 Elo
Lower rating difference: 3.04 Elo
Upper rating difference: 26.02 Elo
Lower bound uncertainty: 14.51 Elo
Upper bound uncertainty: 14.55 Elo
Average error: +/ 14.53 Elo
K = (average error)*[sqrt(n)] = 459.55
Elo interval: ] 3.04, 26.02[

Elo interval for 3sigma confidence:
Elo rating difference: 11.47 Elo
Lower rating difference: 10.30 Elo
Upper rating difference: 33.33 Elo
Lower bound uncertainty: 21.77 Elo
Upper bound uncertainty: 21.86 Elo
Average error: +/ 21.81 Elo
K = (average error)*[sqrt(n)] = 689.83
Elo interval: ] 10.30, 33.33[

Number of games of the match: 1000
Score: 51.65 %
Elo rating difference: 11.47 Elo
Draw ratio: 56.30 %
**********************************************
1 sigma: 1.0439 % of the points of the match.
2 sigma: 2.0878 % of the points of the match.
3 sigma: 3.1318 % of the points of the match.
**********************************************
Error bars were calculated with twosided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

Calculation of likelihood of superiority (LOS) in a onesided test:

LOS: 94.30 %
This value of LOS is rounded up to 0.01%
End of the calculations. Approximated elapsed time: 47 ms.
Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
Regards from Spain.
Ajedrecista.

 Posts: 2042
 Joined: Wed Mar 08, 2006 8:30 pm
Re: Critter 1.6  Critter 1.4a, ponder ON/OFF.
Hi Jesus,Ajedrecista wrote:I suppose that all these error bars were obtained with the great BayesElo.
Not at all, I compute these ratings and error bars by hand (sometimes using a hand calculator), using my basic knowledge (school/university) in statistics.
Here we have a trinomial distribution (winlossdraw) and when the result is close to 50%, you have SD(sigma)=[sqrt(W+L)]/2N (formula is a little more complicated if the result is not close to 50%)
So I find the 2SD error bar (95% probability) of
Code: Select all
P OFF
1 Critter 1.6 64bit +23 +125/=283/92 53.30% 266.5/500
2 Critter 1.4a 64bit SSE4 23 +92/=283/125 46.70% 233.5/500
and multiplying that % by 7 (valid for low %) you get the Elo error bar = 20.6 rounded to 21.
Now if we want to see if Ponder OFF or ON makes a significant difference in a match between Critter 1.6 and Critter 1.4a SSE4, we have to consider the global (sum) distribution
Code: Select all
P ON+OFF
1 Critter 1.6 64bit +12 +235/=563/202 51.65% 516.5/1000
2 Critter 1.4a 64bit SSE4 12 +202/=563/235 48.35% 483.5/1000
(you were perfectly right with your However, in the second match, ratting difference is ~ 11.5 Elo, not 15 Elo... which is more likely the error bar ).
If from this 1000game distribution you pick a 500game sample, you expect that sample to have a mean of 51.65% (or +12 Elo) and a SD of sqrt(1000/500)*15/2= 11 Elo
Since the actual P ON sample (50%, 0 Elo) is 12 Elo away from that mean, SD being 11 Elo, that P ON sample does not distinguish itself enough from the P ON+OFF distribution.
Same reasoning for the actual P OFF sample (53.3%, 23 Elo).

 Posts: 1988
 Joined: Wed Jul 13, 2011 9:04 pm
 Location: Madrid, Spain.
Re: Critter 1.6  Critter 1.4a, ponder ON/OFF.
Hi again!
You will find a value for the average error <e> in the post I called #2. This is:
If you had not posted the trick of multiplying by seven, I will not realize never about this number 16/ln(10), so today I have learnt something. Thanks!
@Maurizio: There is no intention of hijacking your thread, but you can see that statistics applied to error bars could be a whole world! At least it is boundless for me. Thanks for your comprehension and your tests! Please keep up the good work.
Regards from Spain.
Ajedrecista.
I see. I also used to calculate them by hand with the only help of a hand calculator, until I did a programme in Fortran. I use this standard deviation:ernest wrote:I compute these ratings and error bars by hand (sometimes using a hand calculator), using my basic knowledge (school/university) in statistics.
I took this formula from the 22nd post of this thread. I posted two messages in January that might be useful: #1 and #2.n = wins + draws + loses
µ = (wins + draws/2)/n
D = draws/n
σ = sqrt{[µ·(1  µ)  D/4]/n}
I did not know that, when µ ~ 0.5, then σ ~ sqrt(wins + loses)/2n in this trinomial distribution. It is interesting, so thank you for share it. Rewriting your standard deviation using the draw ratio D: σ = sqrt[n·(1  D)]/2n = (1/2)·sqrt[(1  D)/n]. If I compare our nσ², I obtain:ernest wrote:Here we have a trinomial distribution (winlossdraw) and when the result is close to 50%, you have SD(sigma)=[sqrt(W+L)]/2N (formula is a little more complicated if the result is not close to 50%)
Which are exactly the same with µ = 1/2. Your nσ² does not depend on µ, while mine yes... although the expression of σ that I use is not good for µ (or 1  µ) > 0.85 or 0.9, for saying something. For your info: µ must be in the interval [0.15, 0.85] in my programme, else it does not calculate anything. The farest is µ from 1/2, the less accurate is the value of σ; it also has a problem with the extreme case of D = 1 (100% of draws), when σ = 0. But it is just a model that works reasonably well in real cases.(Yours): nσ² = (1  D)/4
(Mine): nσ² = µ·(1  µ)  D/4; (mine with µ = 0.5): nσ² = (1  D)/4
You will find a value for the average error <e> in the post I called #2. This is:
Where k denotes the confidence level (k = 1.96 for ~ 95% confidence, k = 2 for ~ 95.45% confidence, etc.). If I replace µ = 1/2 in that equation:<e> = 200·log[(µ + kσ)(1  µ + kσ)/(µ  kσ)(1  µ  kσ)]
Here, σ is not in percentage; if you want σ in percentage, then the constant that multiplies kσ is 16/ln(10) ~ 6.9487, which is almost your seven. This could be valid in my approximation with a normal distribution, although this should be valid only when µ = 0.5 and kσ (or σ, for reasonable confidence levels, where k is finite) tends to zero, because it is a rough approximation with those assumptions.<e> = 200·log[(0.5 + kσ)(0.5 + kσ)/(0.5  kσ)(0.5  kσ)] = 400·log[(0.5 + kσ)/(0.5  kσ)] = [400/ln(10)]·[ln(1 + 2kσ)  ln(1  2kσ)]
With kσ > 0 and kσ << 1 (lots of games): ln(1 + 2kσ) ~ 2kσ; ln(1  2kσ) ~ 2kσ
<e> ~ 400·4kσ/ln(10) = [1600/ln(10)]·kσ
If you had not posted the trick of multiplying by seven, I will not realize never about this number 16/ln(10), so today I have learnt something. Thanks!
@Maurizio: There is no intention of hijacking your thread, but you can see that statistics applied to error bars could be a whole world! At least it is boundless for me. Thanks for your comprehension and your tests! Please keep up the good work.
Regards from Spain.
Ajedrecista.

 Posts: 766
 Joined: Sun Oct 16, 2011 11:25 am
Re: Critter 1.6  Critter 1.4a, ponder ON/OFF.
Hi, i'm only glad of this interest And i'm interested too. ThanksAjedrecista wrote:
@Maurizio: There is no intention of hijacking your thread, but you can see that statistics applied to error bars could be a whole world! At least it is boundless for me. Thanks for your comprehension and your tests! Please keep up the good work.
Regards from Spain.
Ajedrecista.
MM

 Posts: 2042
 Joined: Wed Mar 08, 2006 8:30 pm
Re: Critter 1.6  Critter 1.4a, ponder ON/OFF.
Hi Jesus,Ajedrecista wrote:Hi again!
Thanks for this detailed post, I will study it carefully!
Of course, your program gives more accurate numbers, I only get (not too bad) approximations.
Do you have a comment on my section starting with
Now if we want to see if Ponder OFF or ON makes a significant difference in a match between Critter 1.6 and Critter 1.4a SSE4, we have to consider the global (sum) distribution
which shows that so far (i.e. with only those 500+500 games) the difference is NOT significant?