CRoberson wrote:Adam Hair wrote:To anyone who reads this:
What would be your reaction if we purposely disconnected the CCRL from any comparison to human ratings?
What if we make the rating for the top engine equal 0 Elo, so that the ratings are such that the rating of each engine directly indicates how many Elo it is behind the leading program?
My thought is to treat them like what they really are -- ranking lists not rating lists. So, drop reporting the ratings; keep them for calculating rankings.
Your solution is reasonable and it does solve the big problem.
I believe that, by not reporting the ratings, we would be throwing out the bit of information that the rating lists are good for. The rating lists do a good job of answering the easier question "What group of engines is Engine X comparable to?". Rankings do not tell us about the gaps in the relative strengths of programs. Looking at 1CPU engines at 40/4, Rybka 4.1 64-bit is ranked 6th and Fritz 13 is ranked 7th. Yet Fritz 13 is not in Rybka's league. Also, it is not clear if Rybka should be ranked 6th or possibly as high as 2nd.
If rankings were used, the LOS data should be included. But that is probably indecipherable to some people.
It would be interesting to compute rankings using a minimum rank violation model. I have seen it used to model
college football rankings. One problem is that any minimum violation ranking list constructed is not the only one possible. Also, I don't know if it is computational feasible to use with 300+ engines.
CRoberson wrote:
The problem is:
Accurate software ratings are impossible. Why? Because, everybody in the world has different hardware. Each processor of my computer is 6x faster than your base machine (AMD 4600). This means that nearly all programs on your list get a rating boost from the speed up. The other big issue is that boost is dynamic: some programs get more than others due to bugs and so forth.
Without a doubt you are correct. The ratings lose almost all validity when looking at conditions that differ too much from those used to construct the list.
CRoberson wrote:
Ares has played several human GM's (since the last version) online and in person. The best they have done is a draw. Of course,
I use my 6x faster hardware. It is quite clear that one number fits all doesn't work.
More reason for me to believe that the ratings for the 40/40 list are not necessarily too high. As for the 40/4 list, who knows what are valid numbers? I have not done nearly enough time odds testing to answer that.
CRoberson wrote:
I see only two directions to fix it.
1) Make the ratings unrelated to humans. They are not well correlated to humans as is.
2) Adjust the time controls to keep up with the best HW not old HW. If the best HW is 6x faster, make the TC's 6x longer or get better HW or do it like SSDF which reports a rating for a HW and SW combination. That is likely best if you can line the ratings up with humans.
Solution #2 is not feasible for the CCRL or the CEGT. Solution #1 is unacceptable to many of those who pay attention the rating lists. Maybe our best bet is to emulate James T. Kirk
CRoberson wrote:
Sounds like CCRL conformed for the sake of conformity. Sounds bad. OTOH, y'all are trying!