18 days from SF4 release and about ~30+ ELO gain!

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

lkaufman
Posts: 6257
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by lkaufman »

Ajedrecista wrote:Hello Larry:
lkaufman wrote:Number of games was something like 6 or 7 thousand (I forget exact number and I don't have it handy right now), so error bar was somewhere around 4 elo I think.
Sure? I think that this error bar of circa ± 4 Elo for 6000 or 7000 games corresponds for a one-sigma confidence level, that is, ~ 68.27% confidence level. Since we are accustomed to 95% confidence level ~ 1.96-sigma confidence level, and an Elo gap of 11.5 Elo translates into a score 51.7%-48.3% (near 50%-50%), then the error bars for 95% confidence are (in first approximation) 1.96*(± 4), that is, around ± 8 Elo (from ± 7 to ± 9 because the original ± 4 could be ± 3.6 or ± 4.4 Elo). Please confirm my thought. Thanks in advance.

Regards from Spain.

Ajedrecista.
My exact result was 1847 wins, 1566 losses, 5127 draws, so somewhat more games than I recalled. I show the error margin for this as 4.8 elo (that's supposed to be 95% confidence). Do you agree?

Larry
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by bob »

lkaufman wrote:
Ajedrecista wrote:Hello Larry:
lkaufman wrote:Number of games was something like 6 or 7 thousand (I forget exact number and I don't have it handy right now), so error bar was somewhere around 4 elo I think.
Sure? I think that this error bar of circa ± 4 Elo for 6000 or 7000 games corresponds for a one-sigma confidence level, that is, ~ 68.27% confidence level. Since we are accustomed to 95% confidence level ~ 1.96-sigma confidence level, and an Elo gap of 11.5 Elo translates into a score 51.7%-48.3% (near 50%-50%), then the error bars for 95% confidence are (in first approximation) 1.96*(± 4), that is, around ± 8 Elo (from ± 7 to ± 9 because the original ± 4 could be ± 3.6 or ± 4.4 Elo). Please confirm my thought. Thanks in advance.

Regards from Spain.

Ajedrecista.
My exact result was 1847 wins, 1566 losses, 5127 draws, so somewhat more games than I recalled. I show the error margin for this as 4.8 elo (that's supposed to be 95% confidence). Do you agree?

Larry
It takes me about 30K games to get to +/- 4 using BayesElo, on my cluster testing...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 18 days from SF4 release and about ~30+ ELO gain!

Post by bob »

Uri Blass wrote:
gladius wrote:
lkaufman wrote:
gladius wrote:
lkaufman wrote:
Masta wrote:Yeah...seems that SF will run over other engines like a damn TRUCK!

18 days from release date of SF4 and almost +30 ELO gain. -> http://95.47.140.100/tests/view/522bcb1 ... 2ee68dc04a

Have a nice day yo false magicians. Your days are counted.
Since I found this hard to believe, I ran a similar test myself (SF Sept. 8 vs SF4). While the details differ slightly (book, exact time limit, hardware) the test was quite similar. My result showed a gain of just 11.5 elo. The difference is too large to attribute to sample error. Any other theories?
What were your testing conditions (time control, threads, # of games)? I'm assuming it's 11.5 elo +- some error bar :).

SF4 release version has a few changes that can influence self tests, the TT is not cleared between games, and Idle threads sleep is set to false, but that only affects matches with threads > 1. For this reason, our regression tests are performed against the non-release version.

Otherwise, I'm not really sure to be honest.
Time limit was 2' + 1.2" for about half the games and 30" + .3" on the other half (on faster hardware), so on average about like yours. Number of games was something like 6 or 7 thousand (I forget exact number and I don't have it handy right now), so error bar was somewhere around 4 elo I think. Does clearing TT make a measurable difference in these direct matches? Any other settings or factors that could explain the discrepancy? I used default settings for both versions.
7000 games is 95% error bar of 8 ELO or so, it's entirely possible this was just an unlucky run.

The PGN has 48,491 games, so we should be okay there.
I do not see how you get error bar of 8 elo for 7000 games and I think that it is 4-5 elo.

You have 2.8 error bar after 20,000 games
see for example the regression of latest stockfish

http://tests.stockfishchess.org/tests/v ... 63f25cba49
you should have 2.8*sqrt(20,000/7000) after 7000 games that is between 4 elo and 5 elo.
Here's some serious numbers:

Code: Select all


   2 Crafty-23.6-2        2640    4    4 30080   65%  2519   24%
   3 Crafty-23.6-1        2639    4    4 30080   65%  2519   25%
   4 Crafty-23.7R02-50    2636    4    4 30080   64%  2519   24%
   5 Crafty-23.7R03-1     2633    4    4 30080   64%  2519   25%
30,080 games => +/- 4 Elo using BayesElo.
lkaufman
Posts: 6257
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by lkaufman »

bob wrote:
lkaufman wrote:
Ajedrecista wrote:Hello Larry:
lkaufman wrote:Number of games was something like 6 or 7 thousand (I forget exact number and I don't have it handy right now), so error bar was somewhere around 4 elo I think.
Sure? I think that this error bar of circa ± 4 Elo for 6000 or 7000 games corresponds for a one-sigma confidence level, that is, ~ 68.27% confidence level. Since we are accustomed to 95% confidence level ~ 1.96-sigma confidence level, and an Elo gap of 11.5 Elo translates into a score 51.7%-48.3% (near 50%-50%), then the error bars for 95% confidence are (in first approximation) 1.96*(± 4), that is, around ± 8 Elo (from ± 7 to ± 9 because the original ± 4 could be ± 3.6 or ± 4.4 Elo). Please confirm my thought. Thanks in advance.

Regards from Spain.

Ajedrecista.
My exact result was 1847 wins, 1566 losses, 5127 draws, so somewhat more games than I recalled. I show the error margin for this as 4.8 elo (that's supposed to be 95% confidence). Do you agree?

Larry
It takes me about 30K games to get to +/- 4 using BayesElo, on my cluster testing...
I think you are talking about play against an array of different opponents. It takes less games to reach the same significance for a given elo difference in a direct match.

Also we don't use BayesElo anymore, as it simply does not agree with the standard elo model due to different assumptions, but I think this has only a modest effect on error bars.

Larry
Uri Blass
Posts: 10876
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by Uri Blass »

lkaufman wrote:
bob wrote:
lkaufman wrote:
Ajedrecista wrote:Hello Larry:
lkaufman wrote:Number of games was something like 6 or 7 thousand (I forget exact number and I don't have it handy right now), so error bar was somewhere around 4 elo I think.
Sure? I think that this error bar of circa ± 4 Elo for 6000 or 7000 games corresponds for a one-sigma confidence level, that is, ~ 68.27% confidence level. Since we are accustomed to 95% confidence level ~ 1.96-sigma confidence level, and an Elo gap of 11.5 Elo translates into a score 51.7%-48.3% (near 50%-50%), then the error bars for 95% confidence are (in first approximation) 1.96*(± 4), that is, around ± 8 Elo (from ± 7 to ± 9 because the original ± 4 could be ± 3.6 or ± 4.4 Elo). Please confirm my thought. Thanks in advance.

Regards from Spain.

Ajedrecista.
My exact result was 1847 wins, 1566 losses, 5127 draws, so somewhat more games than I recalled. I show the error margin for this as 4.8 elo (that's supposed to be 95% confidence). Do you agree?

Larry
It takes me about 30K games to get to +/- 4 using BayesElo, on my cluster testing...
I think you are talking about play against an array of different opponents. It takes less games to reach the same significance for a given elo difference in a direct match.

Also we don't use BayesElo anymore, as it simply does not agree with the standard elo model due to different assumptions, but I think this has only a modest effect on error bars.

Larry
I guess that the main point is that there are more draws in a match against a previous version and this is the reason for smaller error.

The stockfish team get more than 64% draws in the games against a previous version and I believe that hyatt clearly get less draws in his games.

with less draws when they tested against very old version the stockfish team found higher possible error for 20,000 games and 3.3>2.8

Here are 2 tests of the stockfish team
ELO: 56.66 +-3.3 (95%) LOS: 100.0%(candidate version for stockfish 4 against stockfish3)
Total: 20000 W: 6221 L: 2988 D: 10791

regression test of stockfish developement against stockfish4
ELO: 24.34 +-2.8 (95%) LOS: 100.0%
Total: 20000 W: 4224 L: 2825 D: 12951
User avatar
Ajedrecista
Posts: 2121
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by Ajedrecista »

Hello Larry:
lkaufman wrote:My exact result was 1847 wins, 1566 losses, 5127 draws, so somewhat more games than I recalled. I show the error margin for this as 4.8 elo (that's supposed to be 95% confidence). Do you agree?

Larry
Thanks for sharing the results. 1847 + 1566 + 5127 = 8540 games, more than 6000 or 7000 games I thought when reading your previous post. The draw ratio is 5127/8540 ~ 60.04%, which is higher than what I expected (I think I misunderstood: if I thought from the very first moment that it was a self test, then I would supposed a draw ratio of near 60%; but I misunderstood, so I thought it was less). More games lowers error bars as well as an increasing draw ratio (with the model I use). Furthermore, your figure 4.8 is closer to 5 than to 4. All these misunderstandings have been accumulated.

For your result of +1847 -1566 =5127, I get more less +11.44 ± 4.66 Elo with 95% confidence (LOS ~ 100%). If the scores are near 50%-50%, error bars can be estimated as ±[800*z*sqrt(1 - D)]/[ln(10)*sqrt(games)] ~ ±347.44*z*sqrt[(1 - D)/games], where z is z-score in a normal distribution (z ~ 1.96 for 95% confidence) and D is the draw ratio. This estimate is highly dependant with the draw ratio, as you can see. But, when scores are close enough to 50%-50%? At first glance I would say that 1/[4*score*(1 - score)] < 1.02 (nothing empirical, of course!), that is, score ~ ]0.43, 0.57[ (a maximum gap of 49 Elo more less). Knowing that in this case score ~ 51.65% (it is inside the interval ]43%, 57%[) and D ~ 60%: ±800*1.96*sqrt[(1 - 0.6)/8540]/ln(10) ~ ± 4.66 Elo (there is almost no error with the result I wrote before).

------------------------
Uri Blass wrote:after 20,000 games the error bar is 2.8 elo with 95% confidence after 20,000 games

see http://tests.stockfishchess.org/tests/v ... 63f25cba49

It means that in the worst case of 6000 games the error bar is
2.8*sqrt(20,000/6000) that is near 5 elo.
Your rule of the thumb formula can be applied because the score is greater than 43% and less than 57% and also because the draw ratio is something similar to 60% (Larry's result). With the model I use, draw ratio plays an important role in error bars.

------------------------
gladius wrote:I used my rating calculator http://forwardcoding.com/projects/ajaxchess/rating.html, assuming 60% draw rate, and 10 elo advantage. It gives this:

ELO: 9.93 +- 8.15
LOS: 99.99%
Wins: 1600 Losses: 1400 Draws: 4000

However, I just tried it with fishtest's stat_util.py https://github.com/glinscott/fishtest/b ... at_util.py and it gives
ELO: 11.56 +- 6.2
LOS: 99.99%

I would tend to trust stat_util.py more, but I'm not honestly sure.
In your example of +1600 -1400 =4000, I get more less +9.93 ± 5.33 Elo for 95% confidence (LOS ~ 99.99%). At first approximation, it looks like you tried with 3-sigma confidence ~ 99.73% confidence: 8.15/(5.33/1.96) = 1.96*(8.15/5.33) ~ 3.

I do not know exactly what numbers are you using in the second example.

------------------------
Uri Blass wrote:I guess that the main point is that there are more draws in a match against a previous version and this is the reason for smaller error.

The stockfish team get more than 64% draws in the games against a previous version and I believe that hyatt clearly get less draws in his games.

with less draws when they tested against very old version the stockfish team found higher possible error for 20,000 games and 3.3>2.8

Here are 2 tests of the stockfish team
ELO: 56.66 +-3.3 (95%) LOS: 100.0%(candidate version for stockfish 4 against stockfish3)
Total: 20000 W: 6221 L: 2988 D: 10791

regression test of stockfish developement against stockfish4
ELO: 24.34 +-2.8 (95%) LOS: 100.0%
Total: 20000 W: 4224 L: 2825 D: 12951
You caught the point! :) My previous answer in this post to you explains that.

------------------------
Uri Blass wrote:I do not see how you get error bar of 8 elo for 7000 games and I think that it is 4-5 elo.

You have 2.8 error bar after 20,000 games
see for example the regression of latest stockfish

http://tests.stockfishchess.org/tests/v ... 63f25cba49
you should have 2.8*sqrt(20,000/7000) after 7000 games that is between 4 elo and 5 elo.
Other answer to you, Uri. ;) Please see my answer to Larry at the beginning of this post: for an unknown reason I supposed that Larry's test was not a self test, so I expected a lower draw ratio of, let me say, 40%. My wrong assumption altered the estimate a lot! The number of games was quite different (8540 real games against 6000 or 7000 games I thought in a first moment).

------------------------
bob wrote:Here's some serious numbers:

Code: Select all

   2 Crafty-23.6-2        2640    4    4 30080   65%  2519   24% 
   3 Crafty-23.6-1        2639    4    4 30080   65%  2519   25% 
   4 Crafty-23.7R02-50    2636    4    4 30080   64%  2519   24% 
   5 Crafty-23.7R03-1     2633    4    4 30080   64%  2519   25% 
30,080 games => +/- 4 Elo using BayesElo.
At first approximation, I would say for a score of 64.5% and a draw ratio of 24.5%: ±800*1.96*sqrt[1/(4*0.645*0.355) - 0.245]/[ln(10)*sqrt(30080)] ~ ± 3.61 Elo.

Now, if I take +15717 -6993 =7370 (score ~ 65.5% and draw ratio ~ 24.5%; number of games: 30080), I obtain error bars of ± 3.51 Elo more less (using my own model). There is a 2.77% error between the estimate and the true error bar (I remember once again: always with my model, which is not perfect).

------------------------

Sorry for all this technical/mathematical stuff in the General Topics section.

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by Laskos »

Ajedrecista wrote:
At first approximation, I would say for a score of 64.5% and a draw ratio of 24.5%: ±800*1.96*sqrt[1/(4*0.645*0.355) - 0.245]/[ln(10)*sqrt(30080)] ~ ± 3.61 Elo.
Regards from Spain.

Ajedrecista.
I don't understand your formula, isn't it 800/log(10) * 1.96*sqrt(4*0.645*0.355-0.245)/sqrt(30080) ~ 3.22 Elo points?
User avatar
Ajedrecista
Posts: 2121
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by Ajedrecista »

Hello Kai:
Laskos wrote:
Ajedrecista wrote:
At first approximation, I would say for a score of 64.5% and a draw ratio of 24.5%: ±800*1.96*sqrt[1/(4*0.645*0.355) - 0.245]/[ln(10)*sqrt(30080)] ~ ± 3.61 Elo.
Regards from Spain.

Ajedrecista.
I don't understand your formula, isn't it 800/log(10) * 1.96*sqrt(4*0.645*0.355-0.245)/sqrt(30080) ~ 3.22 Elo points?
[800/ln(10)]*1.96*sqrt(4*0.645*0.355 - 0.245)/sqrt(30080) ~ 3.22 indeed. But please note that it is not what I wrote. Your typo comes in sqrt(4*0.645*0.355 - 0.245) ~ 0.819, while I wrote sqrt[1/(4*0.645*0.355) - 0.245] ~ 0.9202. Of course: (0.9202/0.819)*3.21 ~ 3.61 (my estimate). Thanks for your interest.

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by Laskos »

Ajedrecista wrote:Hello Kai:
Laskos wrote:
Ajedrecista wrote:
At first approximation, I would say for a score of 64.5% and a draw ratio of 24.5%: ±800*1.96*sqrt[1/(4*0.645*0.355) - 0.245]/[ln(10)*sqrt(30080)] ~ ± 3.61 Elo.
Regards from Spain.

Ajedrecista.
I don't understand your formula, isn't it 800/log(10) * 1.96*sqrt(4*0.645*0.355-0.245)/sqrt(30080) ~ 3.22 Elo points?
[800/ln(10)]*1.96*sqrt(4*0.645*0.355 - 0.245)/sqrt(30080) ~ 3.22 indeed. But please note that it is not what I wrote. Your typo comes in sqrt(4*0.645*0.355 - 0.245) ~ 0.819, while I wrote sqrt[1/(4*0.645*0.355) - 0.245] ~ 0.9202. Of course: (0.9202/0.819)*3.21 ~ 3.61 (my estimate). Thanks for your interest.

Regards from Spain.

Ajedrecista.
That's I am questioning, isn't it sqrt(4*0.645*0.355 - 0.245) ~ 0.819 instead of sqrt[1/(4*0.645*0.355) - 0.245] ~ 0.9202 ?
User avatar
Ajedrecista
Posts: 2121
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: 19 days from SF 4 release and about ~30 Elo gain!

Post by Ajedrecista »

Hello again:
Laskos wrote:
Ajedrecista wrote:Hello Kai:
Laskos wrote:
Ajedrecista wrote:
At first approximation, I would say for a score of 64.5% and a draw ratio of 24.5%: ±800*1.96*sqrt[1/(4*0.645*0.355) - 0.245]/[ln(10)*sqrt(30080)] ~ ± 3.61 Elo.
Regards from Spain.

Ajedrecista.
I don't understand your formula, isn't it 800/log(10) * 1.96*sqrt(4*0.645*0.355-0.245)/sqrt(30080) ~ 3.22 Elo points?
[800/ln(10)]*1.96*sqrt(4*0.645*0.355 - 0.245)/sqrt(30080) ~ 3.22 indeed. But please note that it is not what I wrote. Your typo comes in sqrt(4*0.645*0.355 - 0.245) ~ 0.819, while I wrote sqrt[1/(4*0.645*0.355) - 0.245] ~ 0.9202. Of course: (0.9202/0.819)*3.21 ~ 3.61 (my estimate). Thanks for your interest.

Regards from Spain.

Ajedrecista.
That's I am questioning, isn't it sqrt(4*0.645*0.355 - 0.245) ~ 0.819 instead of sqrt[1/(4*0.645*0.355) - 0.245] ~ 0.9202 ?
I think you are right: sqrt[µ*(1 - µ) - D/4] = 0.5*sqrt[4*µ*(1 - µ) - D]. Thanks for pointing out my mistake. I wrote from memory and it is not a good idea.

3.22 has an error of ~ 10.8% when compared to 3.61, quite high IMHO, so it seems that it is better not try to estimate error bars in a fast way when Elo difference is too high (more than 100 Elo in this example).

Regards from Spain.

Ajedrecista.