CCRL live lists with 100 Elo reduction

michiguel · Post by **michiguel** » Wed May 16, 2012 5:54 pm

Adam Hair wrote:To anyone who reads this:

What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?

What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?

This is philosophically sound, but problematic from an experimental point of view. Points of reference need to be as stable as possible. That is the reason why in science the ones that are chosen are the ones that do not obviously change and their measurement is the most precise.
For that reason, the best way to have a reference for a list like this, IMHO, is to get the ~16 engines with more games (i.e. lowest error) in a wide span of the spectrum, average them, and set that average to a fix convenient number. That will guarantee the maximum stability.

for instance, set the average of the elo of these engines to a given number.

Rybka 3 64-bit
Zappa Mexico II 64-bit
Fritz 11
Grapefruit 1.0 32 bit
Bright 0.4a
Spike 1.2 turin (8491 games!)
Ktulu 8
Chess Tiger 2007
Movei 00.8.438 (10 10 10)
Gandalf 6
Aristarch 4.5
Amyan 1.72
SOS 5.1
Ufim 8.02
AnMon 5.60
Ares 1.004

Miguel

IWB · Post by **IWB** » Wed May 16, 2012 7:29 pm

Hehe, I thought about it as well when I published my list (and I asked every list maker to have one common ground ...)

Have a look how many access you have on your page per day and ask yourself how many of that would understand the 0 base. And then think about how many request/questions/proposals you are willing to answer regardless the detailed explainations you are offering ...

Bye
Ingo

CRoberson · Post by **CRoberson** » Wed May 16, 2012 8:02 pm

IWB wrote:Hehe, I thought about it as well when I published my list (and I asked every list maker to have one common ground ...)

Have a look how many access you have on your page per day and ask yourself how many of that would understand the 0 base. And then think about how many request/questions/proposals you are willing to answer regardless the detailed explainations you are offering ...

Bye
Ingo

That is the fundamental problem with conformity in these situations. Do you do what is correct or what the majority can deal with?

IWB · Post by **IWB** » Wed May 16, 2012 8:06 pm

CRoberson wrote:That is the fundamental problem with conformity in these situations. Do you do what is correct or what the majority can deal with?

That depends on the importance of the issue. In this case it is neither of what you implied as it makes no difference to have one or the other solution - except the amount of work one has with it ... !

Bye
Ingo

CRoberson · Post by **CRoberson** » Wed May 16, 2012 8:13 pm

michiguel wrote:
Adam Hair wrote:To anyone who reads this:

What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?

What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
This is philosophically sound, but problematic from an experimental point of view. Points of reference need to be as stable as possible. That is the reason why in science the ones that are chosen are the ones that do not obviously change and their measurement is the most precise.
For that reason, the best way to have a reference for a list like this, IMHO, is to get the ~16 engines with more games (i.e. lowest error) in a wide span of the spectrum, average them, and set that average to a fix convenient number. That will guarantee the maximum stability.

for instance, set the average of the elo of these engines to a given number.

Rybka 3 64-bit
Zappa Mexico II 64-bit
Fritz 11
Grapefruit 1.0 32 bit
Bright 0.4a
Spike 1.2 turin (8491 games!)
Ktulu 8
Chess Tiger 2007
Movei 00.8.438 (10 10 10)
Gandalf 6
Aristarch 4.5
Amyan 1.72
SOS 5.1
Ufim 8.02
AnMon 5.60
Ares 1.004

Miguel

This far better than having one engine as the base value, but still doesn't solve the FIDE alignment problem. If you are going to make your ratings look human then they should align with human ratings.

Of course, Miguel's idea can be used even with having the top rating as 0.

Can you really align with humans?

So pick a subset of the listed engines and have enough FIDE players with "nearly static" playing strength and accurate ratings
play enough games. (problem: people learn while playing a match). Then create an average as Miguel suggests or run a round robin tournament with all the engines listed by Miguel and then create an average.

Now, we are down to two fundamental problems:
1) How to deal with new faster hardware?
2) Is a 2400 player of 2012 the same as a 2400 player of 1998? Does "rating drift" exist? If so, recalibration is a periodic need.

Adam Hair · Post by **Adam Hair** » Wed May 16, 2012 8:35 pm

mar wrote:
Adam Hair wrote:What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
That's what I proposed to Werner when I read about CEGT renormalization! It really doesn't matter. Elo is relative as we all know.
Just imagine you are an engine author and you want to give the users a rough elo estimate (the first question authors get from users when they release their engine, first public version).
People simply like absolute numbers like 2400, 2500 etc. And they tend to compare that to FIDE ratings etc.
Of course the author could state that his engine is n elo stronger/weaker than reference engine X.
Another thing is that you are trying hard to break a certain elo barrier, say 2700. And suddenly all top rating lists renormalize and drop at least 100 elo.
C'mon, more sounds better Just make us programmers happy to have a stable absolute RL. What if someone decides next month to renormalize again? I used to say nn CEGT elo/nn CCRL elo.
Now I can't do that anymore

The very least we could so is add Elo, not take it away

Adam Hair · Post by **Adam Hair** » Wed May 16, 2012 8:38 pm

Dan Honeycutt wrote:
Adam Hair wrote:To anyone who reads this:

What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?

What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
Yuck would be my reaction.

Anybody with any chess experience has a feel for what 1000 or 2000 or 3000 elo means. You go make the numbers negative you'll leave the simpletons such as myself completely perplexed.

Best
Dan H.

I believe that your opinion would be the majority opinion, exactly for the reason you give. Not that you would be truly perplexed, but it would appear to be unnatural.

Adam Hair · Post by **Adam Hair** » Wed May 16, 2012 9:53 pm

CRoberson wrote:
Adam Hair wrote:To anyone who reads this:

What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?

What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
My thought is to treat them like what they really are -- ranking lists not rating lists. So, drop reporting the ratings; keep them for calculating rankings.

Your solution is reasonable and it does solve the big problem.

I believe that, by not reporting the ratings, we would be throwing out the bit of information that the rating lists are good for. The rating lists do a good job of answering the easier question "What group of engines is Engine X comparable to?". Rankings do not tell us about the gaps in the relative strengths of programs. Looking at 1CPU engines at 40/4, Rybka 4.1 64-bit is ranked 6th and Fritz 13 is ranked 7th. Yet Fritz 13 is not in Rybka's league. Also, it is not clear if Rybka should be ranked 6th or possibly as high as 2nd.

If rankings were used, the LOS data should be included. But that is probably indecipherable to some people.

It would be interesting to compute rankings using a minimum rank violation model. I have seen it used to model college football rankings. One problem is that any minimum violation ranking list constructed is not the only one possible. Also, I don't know if it is computational feasible to use with 300+ engines.

CRoberson wrote: The problem is:
Accurate software ratings are impossible. Why? Because, everybody in the world has different hardware. Each processor of my computer is 6x faster than your base machine (AMD 4600). This means that nearly all programs on your list get a rating boost from the speed up. The other big issue is that boost is dynamic: some programs get more than others due to bugs and so forth.

Without a doubt you are correct. The ratings lose almost all validity when looking at conditions that differ too much from those used to construct the list.

CRoberson wrote: Ares has played several human GM's (since the last version) online and in person. The best they have done is a draw. Of course,
I use my 6x faster hardware. It is quite clear that one number fits all doesn't work.

More reason for me to believe that the ratings for the 40/40 list are not necessarily too high. As for the 40/4 list, who knows what are valid numbers? I have not done nearly enough time odds testing to answer that.

CRoberson wrote: I see only two directions to fix it.
1) Make the ratings unrelated to humans. They are not well correlated to humans as is.
2) Adjust the time controls to keep up with the best HW not old HW. If the best HW is 6x faster, make the TC's 6x longer or get better HW or do it like SSDF which reports a rating for a HW and SW combination. That is likely best if you can line the ratings up with humans.

Solution #2 is not feasible for the CCRL or the CEGT. Solution #1 is unacceptable to many of those who pay attention the rating lists. Maybe our best bet is to emulate James T. Kirk

CRoberson wrote: Sounds like CCRL conformed for the sake of conformity. Sounds bad. OTOH, y'all are trying!

Adam Hair · Post by **Adam Hair** » Wed May 16, 2012 9:56 pm

Mincho Georgiev wrote:
Adam Hair wrote:To anyone who reads this:

What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?

What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
My idea is probably a funny one, but it's what would bring me some comfort with any rating list. There are enough strong FIDE players here that can estimate one single program. If anyone is agreed to do that, use that program for a base.

Your idea would at least include input from the top chess practioners.

Adam Hair · Post by **Adam Hair** » Wed May 16, 2012 10:53 pm

michiguel wrote:
Adam Hair wrote:To anyone who reads this:

What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?

What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
This is philosophically sound, but problematic from an experimental point of view. Points of reference need to be as stable as possible. That is the reason why in science the ones that are chosen are the ones that do not obviously change and their measurement is the most precise.
For that reason, the best way to have a reference for a list like this, IMHO, is to get the ~16 engines with more games (i.e. lowest error) in a wide span of the spectrum, average them, and set that average to a fix convenient number. That will guarantee the maximum stability.

for instance, set the average of the elo of these engines to a given number.

Rybka 3 64-bit
Zappa Mexico II 64-bit
Fritz 11
Grapefruit 1.0 32 bit
Bright 0.4a
Spike 1.2 turin (8491 games!)
Ktulu 8
Chess Tiger 2007
Movei 00.8.438 (10 10 10)
Gandalf 6
Aristarch 4.5
Amyan 1.72
SOS 5.1
Ufim 8.02
AnMon 5.60
Ares 1.004

Miguel

The CCRL uses 14 engines that were in common with the SSDF in November 24, 2006. The average (or weighted average. I have never asked Kirill for the exact details) rating of the 14 engines match the average SSDF ratings. The SSDF ratings were somewhat connected to human ratings ( see this link ).

When I ran an independent ratings list a couple of years ago, I linked it to the CCRL 40/4 list in the manner you suggest.

CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction

Re: CCRL live lists with 100 Elo reduction