SCCT Rating List - Calculation by EloStat 1.3

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Adam Hair »

This is what I found a few weeks ago when studying the effects of changing the White advantage, drawElo, and scale parameters:

Elo Diff is the difference between Houdini 2.0c 6 CPU and Numpty Omega
The equation for the scale is: scale = (4*10^(drawElo/400))/(1+10^(drawElo/400))^2

Code: Select all

                   Elo Diff            drawElo              scale
PURE LIST
default              1969                97.3                .925497
mm 1 1               1956                92.7639             .931969
mm 1 1, scale 1      2099                92.7639             1

COMPLETE LIST
default              1980                97.3                .925497
mm 1 1               2034               117.81               .893294
mm 1 1, scale 1      2277               117.81               1

The pure list and the complete list were definitely comparable with the default values. That is because the same White advantage, drawElo, and (by default calculation) scale values were the same.

When 'mm 1 1' alone is used, we start to see some differences. The average Elo difference between opponents in the pure database is greater, leading to a lower draw rate (and a lower drawElo value). The Elo difference between Numpty and Houdini actually decreases somewhat as compared to the pure list with default values. The scale increased, but the lower drawElo value causes a lower Elo difference estimate for any particular result. Meanwhile, the difference between Numpty and Houdini on the complete list increases due to the higher drawElo value.

When scale is set to 1, the Elo difference between the reference engines increases for both lists.

I sometimes forget that I have previously thought over certain things. I remember now that I decided to worry a little less about the effects of setting the scale to 1, because using 'mm 1 1' does make comparisons between different types of databases less reasonable. The CCRL 40/4 database (complete or pure) has different characteristics than IPON. It makes complete sense to me to use the most accurate settings to compute the IPON ratings, because it is a more compact database. The IPON database is more homogeneous (the characteristics of the higher Elo games and lower Elo games of IPON are not too different) than the CCRL 40/4 list. As for the CCRL 40/4, the average draw rate does not reflect the draw rates of the top or bottom of the list very well. Thus, I do not think the drawElo value computed from the 40/4 games necessarily gives more accurate ratings. I think that it surely must be more accurate than using the default values, but maybe (given the more heterogeneous nature of the 40/4 database) it is not very much more accurate.

I have been striving to make the CCRL ratings more accurate, but maybe any improvement that has resulted from my suggestions is not worth it.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by michiguel »

Adam Hair wrote:This is what I found a few weeks ago when studying the effects of changing the White advantage, drawElo, and scale parameters:

Elo Diff is the difference between Houdini 2.0c 6 CPU and Numpty Omega
The equation for the scale is: scale = (4*10^(drawElo/400))/(1+10^(drawElo/400))^2

Code: Select all

                   Elo Diff            drawElo              scale
PURE LIST
default              1969                97.3                .925497
mm 1 1               1956                92.7639             .931969
mm 1 1, scale 1      2099                92.7639             1

COMPLETE LIST
default              1980                97.3                .925497
mm 1 1               2034               117.81               .893294
mm 1 1, scale 1      2277               117.81               1

The pure list and the complete list were definitely comparable with the default values. That is because the same White advantage, drawElo, and (by default calculation) scale values were the same.

When 'mm 1 1' alone is used, we start to see some differences. The average Elo difference between opponents in the pure database is greater, leading to a lower draw rate (and a lower drawElo value). The Elo difference between Numpty and Houdini actually decreases somewhat as compared to the pure list with default values. The scale increased, but the lower drawElo value causes a lower Elo difference estimate for any particular result. Meanwhile, the difference between Numpty and Houdini on the complete list increases due to the higher drawElo value.

When scale is set to 1, the Elo difference between the reference engines increases for both lists.

I sometimes forget that I have previously thought over certain things. I remember now that I decided to worry a little less about the effects of setting the scale to 1, because using 'mm 1 1' does make comparisons between different types of databases less reasonable. The CCRL 40/4 database (complete or pure) has different characteristics than IPON. It makes complete sense to me to use the most accurate settings to compute the IPON ratings, because it is a more compact database. The IPON database is more homogeneous (the characteristics of the higher Elo games and lower Elo games of IPON are not too different) than the CCRL 40/4 list. As for the CCRL 40/4, the average draw rate does not reflect the draw rates of the top or bottom of the list very well. Thus, I do not think the drawElo value computed from the 40/4 games necessarily gives more accurate ratings. I think that it surely must be more accurate than using the default values, but maybe (given the more heterogeneous nature of the 40/4 database) it is not very much more accurate.

I have been striving to make the CCRL ratings more accurate, but maybe any improvement that has resulted from my suggestions is not worth it.
FWIW, I just calculated both lists (fixing Gosu at 2300) with Ordo and got

Pure list

Code: Select all

   # ENGINE                             : RATING    POINTS  PLAYED    (%)
   1 Houdini 2.0c 64-bit 6CPU           : 3308.7    1001.0    1350   74.1%
...
 140 Gosu 0.16                          : 2300.0    1650.5    3153   52.3%
...
 344 Numpty Omega                       : 1281.5     276.5    1022   27.1%
Complete list

Code: Select all

   # ENGINE                                  : RATING    POINTS  PLAYED    (%)
   1 Houdini 2.0c 64-bit 6CPU                : 3300.6    1963.5    2859   68.7%
...
 715 Gosu 0.16                               : 2300.0    2764.5    5383   51.4%
...
1009 Numpty Omega                            : 1277.3     294.5    1082   27.2%
Which is a difference of 2027 and 2023, respectively.

Miguel
Modern Times
Posts: 3783
Joined: Thu Jun 07, 2012 11:02 pm

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Modern Times »

The problem we have:

Houdini 2.0 64-bit 6CPU - pure list 3309
Houdini 2.0 64-bit 6CPU - overall list 3396

When someone looks at the CCRL list for a rating, he finds 2 very different values. So is it a 3309 or 3396 ? If an engine author wants to quote a CCRL Elo value, which value does he choose ?

This is the big problem we have. Not that our lists can't be compared to CEGT or anyone else, but they can't be compared *within themselves*

In my opinion we need to revert to Bayeselo default values. The situation above is unacceptable to me.
Modern Times
Posts: 3783
Joined: Thu Jun 07, 2012 11:02 pm

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Modern Times »

Here is the current rating list - scale 1 and mm1
http://www.computerchess.org.uk/ccrl/404/

Here is bayeselo default (temporary link just for a few days)
http://www.computerchess.org.uk/ccrl/404.default/

And here is just mm11 (temporary link just for a few days)
http://www.computerchess.org.uk/ccrl/404.mm11/

You can see immediately how just mm 11 on it's own affects the pure list vs the complete list.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Daniel Shawul »

Which is a difference of 2027 and 2023, respectively.

Miguel
The way you try to push ordo is giving better results is amusing to say the least. Ordo don't have a draw model (in other words lacks something important) but bayeselo do detect the draw ratios very well. For the curious observer, it is natural to expect more draw ratios for the complete list, because complete list contains has duplicate versions of the same engine which will have more draw ratios. I am sure elostat would do better to equal Houdinig ratings (for whatever reason) since it even lacks more than ordo. But that doesn't make it better at all period.
Modern Times
Posts: 3783
Joined: Thu Jun 07, 2012 11:02 pm

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Modern Times »

Daniel Shawul wrote: The way you try to push ordo is giving better results is amusing to say the least.
I don't think that is what he is doing. The thread started with a comparison of EloStat and Bayeselo, it is useful to see another perspective
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Daniel Shawul »

Modern Times wrote:The problem we have:

Houdini 2.0 64-bit 6CPU - pure list 3309
Houdini 2.0 64-bit 6CPU - overall list 3396

When someone looks at the CCRL list for a rating, he finds 2 very different values. So is it a 3309 or 3396 ? If an engine author wants to quote a CCRL Elo value, which value does he choose ?
Ofcourse he can quote CCRL pure list , complete list separately like he mentions blitz rating separaely. Why would you think they will have the same rating? The problem is the complete list have more draw ratios which is to be expected anyway because many similar versions are in that list. With same versions it is to be expected that more draws result. Bayeselo adjusts to that nicely. You can only expect your referece engine to have the same rating in both because for a rating tool it is a completely different collection of games to process.. Say I have Engine A play 10 against previous versions and only 1 against another, now when you do pure list you should get far more different ratings... nothing surprizing there at all.
This is the big problem we have. Not that our lists can't be compared to CEGT or anyone else, but they can't be compared *within themselves*

In my opinion we need to revert to Bayeselo default values. The situation above is unacceptable to me.
Did you see the scale factors are 0.9 so that means a 100 elo difference is maginified to 111 elo diff with scale = 1 ? Also switching back to default values of draw ratio just to get somewhat similar rating ,for a reason you misunderstood by saying compare within themselves, is not correct. If the draw model is significnatly different as bayeselo nicely predicts unlike other tools, then you take that as a fault ??
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Daniel Shawul »

Modern Times wrote:
Daniel Shawul wrote: The way you try to push ordo is giving better results is amusing to say the least.
I don't think that is what he is doing. The thread started with a comparison of EloStat and Bayeselo, it is useful to see another perspective
It is not an opinion but fact that Ordo don't have draw model but bayeselo do, and yet the way it is presented here is misleading to the non curious observer, which is why I object to. Fact is non-existing problem like that have been smartly spread around like 'compression effect' to downplay bayeselo. Ask anyone in programmers forum and he will tell you what that is. I don't want to see that happening again. What would be more honest is to say since "ordo doesn't have a draw model, it gives more or less similar ratings". Then we would ask is it good or bad to have a draw model. I say it is good.
Sedat Canbaz
Posts: 3018
Joined: Thu Mar 09, 2006 11:58 am
Location: Antalya/Turkey

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by Sedat Canbaz »

Dear Experts,

I have a few questions:


I have no much idea about why BayesElo calculated a such strange Elo results:
1) I wonder,why Houdini 2.0t3's Elo performance went down from 3363 to 3359 Elo
*Note that Houdini 2.0t3 's Elo is fall down without playing single game ?

Where i noticed strange results: Fruit 090705's Elo is increased +16 Elo
*Note also that Fruit 090705 's Elo is increased just with playing only 50 games more (against Rybka 4.1 NO-SSE x64 6c)

Is that can be possible to appear a such +16 Elo increasing,even after a such low performance by Fruit ?

Code: Select all

Individual statistics:
Fruit 090705 x64 6c  vs Rybka 4.1 NO-SSE x64 6c 
 50 (+  0,=  8,- 42),  8.0 % 


Rank Name                          Elo    +    - games score oppo. draws 
   1 Houdini 2.0t3 Pro x64 6c     3363   12   12  1700   70%  3212   39% 
  39 Fruit 090705 x64 6c          2965   15   15  1150   23%  3178   29% 

Rank Name                          Elo    +    - games score oppo. draws 
   1 Houdini 2.0t3 Pro x64 6c     3359   14   14  1700   70%  3217   39% 
  39 Fruit 090705 x64 6c          2981   18   18  1200   23%  3190   29%

Thanks in advance,
Sedat
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: SCCT Rating List - Calculation by EloStat 1.3

Post by michiguel »

Daniel Shawul wrote:
Which is a difference of 2027 and 2023, respectively.

Miguel
The way you try to push ordo is giving better results is amusing to say the least.
What??
I just released a a new version, I am testing it, and sharing what I see. I am suppose to hide the results? Who is trying to pushing anything?

Ordo don't have a draw model (in other words lacks something important)
I disagree.

Miguel

but bayeselo do detect the draw ratios very well. For the curious observer, it is natural to expect more draw ratios for the complete list, because complete list contains has duplicate versions of the same engine which will have more draw ratios. I am sure elostat would do better to equal Houdinig ratings (for whatever reason) since it even lacks more than ordo. But that doesn't make it better at all period.