ChessWar XI Promotion : lust of participants

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

YL84

Re: Rating-scale problem

Post by YL84 »

Hi,
don't bother too much on elo ratings, it is just statistics...So it can not be
deterministics. There are lot of limitations on elo evaluation, you know them all h.g., for instance you should play against an infinite number of opponents,
over a large range of strength (at least a random range of strength), which is not realistic. There are lot of threads dealing with the strangeness of elo rating, and this one is not the last (Probaby we should build a FAQ to the limitation of ratings). To obtain a good rating between large difference strength engines would required thousands of games, for reliable results. Elo rating is not perfect at all,
My 2 cents,
Yves
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

After checking the ratings against the third-round results:

It is interesting to see that the results from the first 30 matches (the best-performing engines so far) is quite well predicted by both Oliviers ratings and mine. I should say that of the 42 matches for which I have rated both opponents, there are only 9 amongst these first 30. I predict 2.7 pts for the weaker engines in thes, and observe 2.5. Olivier predicts 9.88 (out of 30) and observes 9. Apparently the ratings obtained from other events (many of these engines are demotees from division F) are not so distorted as from the Promo.

For the remaining matches it doesn't work so well: Olivier's ratings predict 9.7 pts (out of 43) for the weaker engines, and we see only 3.5. My ratings predict 7.7 (out of 33, there were some closely matched pairings in this round), and I observe only 1.5. So even my ratings predict too high a score, though.

I am not sure this prediction error can be ascribed to the rating scale still being too compressed. As I said, most predicted points came from closely matched engines. So it might be tht the overprediction in this case comes about from having some engines ranked wrong on an otherwise accurate scale. (These ratings were only derived from the previous Promo, 11 games per engine. So there are bound to be many engines that were quite lucky there.)

I did not study individual results of games yet.