Charles Roberson

Joined: 13 Mar 2006
Posts: 1697
Location: North Carolina, USA

Post subject: Re: CCRL live lists with 100 Elo reduction    Posted: Wed May 16, 2012 6:13 pm

michiguel wrote:
 Adam Hair wrote: To anyone who reads this: What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings? What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?

This is philosophically sound, but problematic from an experimental point of view. Points of reference need to be as stable as possible. That is the reason why in science the ones that are chosen are the ones that do not obviously change and their measurement is the most precise.
For that reason, the best way to have a reference for a list like this, IMHO, is to get the ~16 engines with more games (i.e. lowest error) in a wide span of the spectrum, average them, and set that average to a fix convenient number. That will guarantee the maximum stability.

for instance, set the average of the elo of these engines to a given number.

Rybka 3 64-bit
Zappa Mexico II 64-bit
Fritz 11
Grapefruit 1.0 32 bit
Bright 0.4a
Spike 1.2 turin (8491 games!)
Ktulu 8
Chess Tiger 2007
Movei 00.8.438 (10 10 10)
Gandalf 6
Aristarch 4.5
Amyan 1.72
SOS 5.1
Ufim 8.02
AnMon 5.60
Ares 1.004

Miguel

This far better than having one engine as the base value, but still doesn't solve the FIDE alignment problem. If you are going to make your ratings look human then they should align with human ratings.

Of course, Miguel's idea can be used even with having the top rating as 0.

Can you really align with humans?

So pick a subset of the listed engines and have enough FIDE players with "nearly static" playing strength and accurate ratings
play enough games. (problem: people learn while playing a match). Then create an average as Miguel suggests or run a round robin tournament with all the engines listed by Miguel and then create an average.

Now, we are down to two fundamental problems:
1) How to deal with new faster hardware?
2) Is a 2400 player of 2012 the same as a 2400 player of 1998? Does "rating drift" exist? If so, recalibration is a periodic need.
