Critter 1.6 - Critter 1.4a ponder ON/OFF

MM · Post by MM » Sun Jun 17, 2012 10:35 pm

Blitz 10s per game, ponder ON

1 Critter 1.6 64-bit +2 +54/=93/-53 50.25% 100.5/200
2 Critter 1.4a 64-bit SSE4 -2 +53/=93/-54 49.75% 99.5/200

Blitz 10s per game, ponder OFF

1 Critter 1.6 64-bit +23 +56/=101/-43 53.25% 106.5/200
2 Critter 1.4a 64-bit SSE4 -23 +43/=101/-56 46.75% 93.5/200

Games played at chess960

Fritz 13 GUI

1 core used

i7 980x

windows 7

Best Regards

ernest · Post by **ernest** » Mon Jun 18, 2012 1:14 am

MM wrote:

Code: Select all

1   Critter 1.6 64-bit             +23  +56/=101/-43 53.25%  106.5/200
2   Critter 1.4a 64-bit SSE4       -23  +43/=101/-56 46.75%   93.5/200

95% error bar is ±34 Elo

MM · Post by MM » Fri Jun 22, 2012 11:56 pm

Hi all,

just for the record.

utente-PC, Blitz 1m ponder ON, 1 core, 3,33 ghz, no tablebases.

1 Critter 1.4a 64-bit SSE4 +110/=280/-110 50.00% 250.0/500 -3036.00
2 Critter 1.6 64-bit +110/=280/-110 50.00% 250.0/500 -3036.00

Best Regards

MM · Post by MM » Sun Jun 24, 2012 12:32 am

MM wrote:Hi all,

just for the record.

utente-PC, Blitz 1m ponder ON, 1 core, 3,33 ghz, no tablebases.

1 Critter 1.4a 64-bit SSE4 +110/=280/-110 50.00% 250.0/500 -3036.00
2 Critter 1.6 64-bit +110/=280/-110 50.00% 250.0/500 -3036.00

Best Regards

utente-PC, Blitz 1m ponder OFF

1 Critter 1.6 64-bit +23 +125/=283/-92 53.30% 266.5/500
2 Critter 1.4a 64-bit SSE4 -23 +92/=283/-125 46.70% 233.5/500

ernest · Post by **ernest** » Sun Jun 24, 2012 2:26 am

MM wrote:

Code: Select all

1	Critter 1.6 64-bit	       +23	+125/=283/-92	53.30%		266.5/500
2	Critter 1.4a 64-bit SSE4	 -23	+92/=283/-125	46.70%		233.5/500

95% error bar is now ±21 Elo, which means that there is more than 95% probability that with ponder OFF, Critter 1.6 is stronger than Critter 1.4a SSE4

However, it cannot yet be said that the (ponder OFF) and (ponder ON) distributions are distinct, with 95% probability, the global result (summation ON+OFF) being:

Code: Select all

1 Critter 1.6 64-bit	       +15	+235/=563/-202	51.65%		516.5/1000
2 Critter 1.4a 64-bit SSE4	 -15	+202/=563/-235	48.35%		483.5/1000

This means that it cannot be said (with 95% probability), as of yet, that compared to Critter 1.4a SSE4, Critter 1.6 performs better with ponder OFF than with Ponder ON.

Ajedrecista · Post by **Ajedrecista** » Sun Jun 24, 2012 10:58 am

Hello Ernest:

ernest wrote:
MM wrote:
Code: Select all
1	Critter 1.6 64-bit	       +23	+125/=283/-92	53.30%		266.5/500
2	Critter 1.4a 64-bit SSE4	 -23	+92/=283/-125	46.70%		233.5/500
95% error bar is now ±21 Elo, which means that there is more than 95% probability that with ponder OFF, Critter 1.6 is stronger than Critter 1.4a SSE4

However, it cannot yet be said that the (ponder OFF) and (ponder ON) distributions are distinct, with 95% probability, the global result (summation ON+OFF) being:
Code: Select all
1 Critter 1.6 64-bit	       +15	+235/=563/-202	51.65%		516.5/1000
2 Critter 1.4a 64-bit SSE4	 -15	+202/=563/-235	48.35%		483.5/1000
This means that it cannot be said (with 95% probability), as of yet, that compared to Critter 1.4a SSE4, Critter 1.6 performs better with ponder OFF than with Ponder ON.

I suppose that all these error bars were obtained with the great BayesElo. I ran my own small programme, just to compare my results:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------

(The input and output data is referred to the first engine).

Please write down non-negative integers.

Write down the number of wins:

125

Write down the number of loses:

92

Write down the number of draws:

283

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3

***************************************
1-sigma confidence ~ 68.27% confidence.
2-sigma confidence ~ 95.45% confidence.
3-sigma confidence ~ 99.73% confidence.
***************************************

---------------------------------------

Elo interval for 1-sigma confidence:

Elo rating difference:     22.96 Elo

Lower rating difference:   12.75 Elo
Upper rating difference:   33.22 Elo

Lower bound uncertainty:  -10.21 Elo
Upper bound uncertainty:   10.25 Elo
Average error:        +/-  10.23 Elo

K = (average error)*[sqrt(n)] =  228.80

Elo interval: ]  12.75,   33.22[
---------------------------------------

Elo interval for 2-sigma confidence:

Elo rating difference:     22.96 Elo

Lower rating difference:    2.56 Elo
Upper rating difference:   43.53 Elo

Lower bound uncertainty:  -20.40 Elo
Upper bound uncertainty:   20.56 Elo
Average error:        +/-  20.48 Elo

K = (average error)*[sqrt(n)] =  458.00

Elo interval: ]   2.56,   43.53[
---------------------------------------

Elo interval for 3-sigma confidence:

Elo rating difference:     22.96 Elo

Lower rating difference:   -7.62 Elo
Upper rating difference:   53.91 Elo

Lower bound uncertainty:  -30.59 Elo
Upper bound uncertainty:   30.95 Elo
Average error:        +/-  30.77 Elo

K = (average error)*[sqrt(n)] =  688.01

Elo interval: ]  -7.62,   53.91[
---------------------------------------

Number of games of the match:                500
Score: 53.30 %
Elo rating difference:   22.96 Elo
Draw ratio: 56.60 %

**********************************************
1 sigma:  1.4657 % of the points of the match.
2 sigma:  2.9314 % of the points of the match.
3 sigma:  4.3970 % of the points of the match.
**********************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------

LOS:  98.78 %

This value of LOS is rounded up to 0.01%

End of the calculations. Approximated elapsed time:  50 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------

(The input and output data is referred to the first engine).

Please write down non-negative integers.

Write down the number of wins:

235

Write down the number of loses:

202

Write down the number of draws:

563

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3

***************************************
1-sigma confidence ~ 68.27% confidence.
2-sigma confidence ~ 95.45% confidence.
3-sigma confidence ~ 99.73% confidence.
***************************************

---------------------------------------

Elo interval for 1-sigma confidence:

Elo rating difference:     11.47 Elo

Lower rating difference:    4.21 Elo
Upper rating difference:   18.74 Elo

Lower bound uncertainty:   -7.26 Elo
Upper bound uncertainty:    7.27 Elo
Average error:        +/-   7.26 Elo

K = (average error)*[sqrt(n)] =  229.67

Elo interval: ]   4.21,   18.74[
---------------------------------------

Elo interval for 2-sigma confidence:

Elo rating difference:     11.47 Elo

Lower rating difference:   -3.04 Elo
Upper rating difference:   26.02 Elo

Lower bound uncertainty:  -14.51 Elo
Upper bound uncertainty:   14.55 Elo
Average error:        +/-  14.53 Elo

K = (average error)*[sqrt(n)] =  459.55

Elo interval: ]  -3.04,   26.02[
---------------------------------------

Elo interval for 3-sigma confidence:

Elo rating difference:     11.47 Elo

Lower rating difference:  -10.30 Elo
Upper rating difference:   33.33 Elo

Lower bound uncertainty:  -21.77 Elo
Upper bound uncertainty:   21.86 Elo
Average error:        +/-  21.81 Elo

K = (average error)*[sqrt(n)] =  689.83

Elo interval: ] -10.30,   33.33[
---------------------------------------

Number of games of the match:               1000
Score: 51.65 %
Elo rating difference:   11.47 Elo
Draw ratio: 56.30 %

**********************************************
1 sigma:  1.0439 % of the points of the match.
2 sigma:  2.0878 % of the points of the match.
3 sigma:  3.1318 % of the points of the match.
**********************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------

LOS:  94.30 %

This value of LOS is rounded up to 0.01%

End of the calculations. Approximated elapsed time:  47 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.

It seems that my programme more less agrees with BayesElo in the first match, which is an achievement! However, in the second match, ratting difference is ~ 11.5 Elo, not 15 Elo... which is more likely the error bar. I notice that you are adding two 500-game matches, so I do not know if I can simply use +235 =563 -202 of one 1000-game match or not.

Regards from Spain.

Ajedrecista.

ernest · Post by **ernest** » Sun Jun 24, 2012 9:09 pm

Ajedrecista wrote:I suppose that all these error bars were obtained with the great BayesElo.

Hi Jesus,

Not at all, I compute these ratings and error bars by hand (sometimes using a hand calculator), using my basic knowledge (school/university) in statistics.
Here we have a trinomial distribution (win-loss-draw) and when the result is close to 50%, you have SD(sigma)=[sqrt(W+L)]/2N (formula is a little more complicated if the result is not close to 50%)
So I find the 2SD error bar (95% probability) of

Code: Select all

P OFF
1 Critter 1.6 64-bit       +23 +125/=283/-92 53.30% 266.5/500 
2 Critter 1.4a 64-bit SSE4 -23 +92/=283/-125 46.70% 233.5/500

to be [sqrt(125+92)]/500 = 2.95%
and multiplying that % by 7 (valid for low %) you get the Elo error bar = 20.6 rounded to 21.

Now if we want to see if Ponder OFF or ON makes a significant difference in a match between Critter 1.6 and Critter 1.4a SSE4, we have to consider the global (sum) distribution

Code: Select all

P ON+OFF
1 Critter 1.6 64-bit       +12 +235/=563/-202 51.65% 516.5/1000 
2 Critter 1.4a 64-bit SSE4 -12 +202/=563/-235 48.35% 483.5/1000

note: see the +12 Elo advantage, not +15 as written by mistake previously, +15 is actually the 2SD of this global (sum) distribution.
(you were perfectly right with your However, in the second match, ratting difference is ~ 11.5 Elo, not 15 Elo... which is more likely the error bar

).

If from this 1000-game distribution you pick a 500-game sample, you expect that sample to have a mean of 51.65% (or +12 Elo) and a SD of sqrt(1000/500)*15/2= 11 Elo
Since the actual P ON sample (50%, 0 Elo) is 12 Elo away from that mean, SD being 11 Elo, that P ON sample does not distinguish itself enough from the P ON+OFF distribution.
Same reasoning for the actual P OFF sample (53.3%, 23 Elo).

Ajedrecista · Post by **Ajedrecista** » Mon Jun 25, 2012 5:22 pm

Hi again!

ernest wrote:I compute these ratings and error bars by hand (sometimes using a hand calculator), using my basic knowledge (school/university) in statistics.

I see. I also used to calculate them by hand with the only help of a hand calculator, until I did a programme in Fortran. I use this standard deviation:

n = wins + draws + loses
µ = (wins + draws/2)/n
D = draws/n

σ = sqrt{[µ·(1 - µ) - D/4]/n}

I took this formula from the 22nd post of this thread. I posted two messages in January that might be useful: #1 and #2.

ernest wrote:Here we have a trinomial distribution (win-loss-draw) and when the result is close to 50%, you have SD(sigma)=[sqrt(W+L)]/2N (formula is a little more complicated if the result is not close to 50%)

I did not know that, when µ ~ 0.5, then σ ~ sqrt(wins + loses)/2n in this trinomial distribution. It is interesting, so thank you for share it. Rewriting your standard deviation using the draw ratio D: σ = sqrt[n·(1 - D)]/2n = (1/2)·sqrt[(1 - D)/n]. If I compare our nσ², I obtain:

(Yours): nσ² = (1 - D)/4

(Mine): nσ² = µ·(1 - µ) - D/4; (mine with µ = 0.5): nσ² = (1 - D)/4

Which are exactly the same with µ = 1/2. Your nσ² does not depend on µ, while mine yes... although the expression of σ that I use is not good for µ (or 1 - µ) > 0.85 or 0.9, for saying something. For your info: µ must be in the interval [0.15, 0.85] in my programme, else it does not calculate anything. The farest is µ from 1/2, the less accurate is the value of σ; it also has a problem with the extreme case of D = 1 (100% of draws), when σ = 0. But it is just a model that works reasonably well in real cases.

You will find a value for the average error |<e>| in the post I called #2. This is:

|<e>| = 200·log[(µ + kσ)(1 - µ + kσ)/(µ - kσ)(1 - µ - kσ)]

Where k denotes the confidence level (k = 1.96 for ~ 95% confidence, k = 2 for ~ 95.45% confidence, etc.). If I replace µ = 1/2 in that equation:

|<e>| = 200·log[(0.5 + kσ)(0.5 + kσ)/(0.5 - kσ)(0.5 - kσ)] = 400·log[(0.5 + kσ)/(0.5 - kσ)] = [400/ln(10)]·[ln(1 + 2kσ) - ln(1 - 2kσ)]

With kσ > 0 and kσ << 1 (lots of games): ln(1 + 2kσ) ~ 2kσ; ln(1 - 2kσ) ~ -2kσ

|<e>| ~ 400·4kσ/ln(10) = [1600/ln(10)]·kσ

Here, σ is not in percentage; if you want σ in percentage, then the constant that multiplies kσ is 16/ln(10) ~ 6.9487, which is almost your seven. This could be valid in my approximation with a normal distribution, although this should be valid only when µ = 0.5 and kσ (or σ, for reasonable confidence levels, where k is finite) tends to zero, because it is a rough approximation with those assumptions.

If you had not posted the trick of multiplying by seven, I will not realize never about this number 16/ln(10), so today I have learnt something. Thanks!

@Maurizio: There is no intention of hijacking your thread, but you can see that statistics applied to error bars could be a whole world! At least it is boundless for me. Thanks for your comprehension and your tests! Please keep up the good work.

Regards from Spain.

Ajedrecista.

MM · Post by MM » Tue Jun 26, 2012 12:34 am

Ajedrecista wrote:

@Maurizio: There is no intention of hijacking your thread, but you can see that statistics applied to error bars could be a whole world! At least it is boundless for me. Thanks for your comprehension and your tests! Please keep up the good work.

Regards from Spain.

Ajedrecista.

Hi, i'm only glad of this interest

And i'm interested too. Thanks

ernest · Post by **ernest** » Tue Jun 26, 2012 1:18 am

Ajedrecista wrote:Hi again!

Hi Jesus,

Thanks for this detailed post, I will study it carefully!

Of course, your program gives more accurate numbers, I only get (not too bad) approximations.

Do you have a comment on my section starting with
Now if we want to see if Ponder OFF or ON makes a significant difference in a match between Critter 1.6 and Critter 1.4a SSE4, we have to consider the global (sum) distribution
which shows that so far (i.e. with only those 500+500 games) the difference is NOT significant?

Critter 1.6 - Critter 1.4a ponder ON/OFF

Critter 1.6 - Critter 1.4a ponder ON/OFF

Re: Critter 1.6 - Critter 1.4a ponder ON/OFF

Re: Critter 1.6 - Critter 1.4a ponder ON/OFF

Re: Critter 1.6 - Critter 1.4a ponder ON/OFF

Re: Critter 1.6 - Critter 1.4a ponder ON/OFF

Re: Critter 1.6 - Critter 1.4a, ponder ON/OFF.

Re: Critter 1.6 - Critter 1.4a, ponder ON/OFF.

Re: Critter 1.6 - Critter 1.4a, ponder ON/OFF.

Re: Critter 1.6 - Critter 1.4a, ponder ON/OFF.

Re: Critter 1.6 - Critter 1.4a, ponder ON/OFF.