What rating list to trust?

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Dann Corbit
Posts: 12803
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: What rating list to trust?

Post by Dann Corbit »

Trust all of them.

Different results are not mutually exclusive and should not surprise us.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What rating list to trust?

Post by bob »

Spock wrote:If you look at the CCRL FRC list, there are 1,200 games played. The other 40/40 and 40/4 lists will no doubt catch up in due course
For programs within 100 points of each other, 1000 games is not nearly enough.
User avatar
Graham Banks
Posts: 44868
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: What rating list to trust?

Post by Graham Banks »

bob wrote:
Spock wrote:If you look at the CCRL FRC list, there are 1,200 games played. The other 40/40 and 40/4 lists will no doubt catch up in due course
For programs within 100 points of each other, 1000 games is not nearly enough.
Using Bayes ELO system you get roughly the following:

When an engine has 200 games played, the error margin is still approximately +-40 ELO, after 500 games +-25 ELO, after 1000 games +-17 ELO and even after 2000 games there is a +-13 ELO error margin!
gbanksnz at gmail.com
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What rating list to trust?

Post by bob »

Graham Banks wrote:
bob wrote:
Spock wrote:If you look at the CCRL FRC list, there are 1,200 games played. The other 40/40 and 40/4 lists will no doubt catch up in due course
For programs within 100 points of each other, 1000 games is not nearly enough.
Using Bayes ELO system you get roughly the following:

When an engine has 200 games played, the error margin is still approximately +-40 ELO, after 500 games +-25 ELO, after 1000 games +-17 ELO and even after 2000 games there is a +-13 ELO error margin!
Right. However, real testing shows that the variance is more than that for 1000 games... When you factor in opening books, along with the inherent randomness caused by inaccurate timing provided by the PC real-time clock, the error bar is _far_ wider than what any of the *elo programs would have you believe.
User avatar
Graham Banks
Posts: 44868
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: What rating list to trust?

Post by Graham Banks »

bob wrote:
Graham Banks wrote:
bob wrote:
Spock wrote:If you look at the CCRL FRC list, there are 1,200 games played. The other 40/40 and 40/4 lists will no doubt catch up in due course
For programs within 100 points of each other, 1000 games is not nearly enough.
Using Bayes ELO system you get roughly the following:

When an engine has 200 games played, the error margin is still approximately +-40 ELO, after 500 games +-25 ELO, after 1000 games +-17 ELO and even after 2000 games there is a +-13 ELO error margin!
Right. However, real testing shows that the variance is more than that for 1000 games... When you factor in opening books, along with the inherent randomness caused by inaccurate timing provided by the PC real-time clock, the error bar is _far_ wider than what any of the *elo programs would have you believe.
Hi Bob,

this is why it's best to look at all rating lists available. This way you can draw a pretty accurate picture of where a given engine is at.

Regards, Graham.
gbanksnz at gmail.com
User avatar
hgm
Posts: 28409
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: What rating list to trust?

Post by hgm »

bob wrote:Right. However, real testing shows that the variance is more than that for 1000 games... When you factor in opening books, along with the inherent randomness caused by inaccurate timing provided by the PC real-time clock, the error bar is _far_ wider than what any of the *elo programs would have you believe.
Please be informed that this is complete bullshit. :P Every Elo program in existence assumes that all games you feed them are totally independent, uncorrelated random events. And most testers in fact go through great length to make sure that they are, using external books to prevent duplicate games etc.

So the error bars represent the statistical errors in an accurate and completely correct way.

Note, however, that the prior assumption made in BayesElo is not quite satisfied in a wide rating list. This leads to a systematic compression of the scale. This is not related to the number of games per engine, though.
Uri Blass
Posts: 10973
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: What rating list to trust?

Post by Uri Blass »

bob wrote:
Graham Banks wrote:
bob wrote:
Spock wrote:If you look at the CCRL FRC list, there are 1,200 games played. The other 40/40 and 40/4 lists will no doubt catch up in due course
For programs within 100 points of each other, 1000 games is not nearly enough.
Using Bayes ELO system you get roughly the following:

When an engine has 200 games played, the error margin is still approximately +-40 ELO, after 500 games +-25 ELO, after 1000 games +-17 ELO and even after 2000 games there is a +-13 ELO error margin!
Right. However, real testing shows that the variance is more than that for 1000 games... When you factor in opening books, along with the inherent randomness caused by inaccurate timing provided by the PC real-time clock, the error bar is _far_ wider than what any of the *elo programs would have you believe.
Real testing shows that the variance is clearly smaller than what you think.
I got it simply by comparing different rating lists because I have no time to make testing with thousands of games.

You even can use one rating list at 120/40 to predict rating at 20/40 with error that is always smaller than 35 elo


http://www.husvankempen.de/nunn/40_120_ ... liste.html

You have 29 programs in this list

You also have the same programs in the 20/40 rating list

http://www.husvankempen.de/nunn/40_40%2 ... liste.html

The biggest difference between rating of these different lists is 31 elo.

Deep Junior 10 2CPU 2803 16 16 1177 44.9 % 2839 34.8 % at 20/40
Deep Junior 10 2CPU 2834 17 17 1050 47.3 % 2852 34.1 % at 120/40

Note that I was surprised by this small difference and the small difference suggest that testing at long time control is almost useless if the target is to get rating(of course it is not useless if the target is to get better games) because in 29 out of 29 cases you can get the rating with error that is smaller than 32 elo by only playing games at 20/40

Note that this is surprising because common sense tell me that we can expect different rating at different time controls and it seems that this factor together with luck when part of the programs have less than 1000 games is not enough even to produce 32 elo difference.

Uri