Also-Rans list updated

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

User avatar
Ajedrecista
Posts: 2188
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Also-Rans list updated.

Post by Ajedrecista »

Hello:
Adam Hair wrote:There is a definite difference.

The ratings for the Also-Rans were actually computed from the entire CCRL 40/4 database, from which I am using the engines rated 2200 Elo or less (as computed with Ordo). Here are the Ordo ratings for the top and bottom of the entire database:

Code: Select all

 # ENGINE                                     : RATING    POINTS  PLAYED    (%) 
   1 Houdini 3 64-bit 4CPU                      : 3417.5    1320.5    1644   80.3% 
   2 Houdini 2.0c 64-bit 4CPU                   : 3360.1    1833.5    2465   74.4% 
   3 Houdini 1.5a 64-bit 4CPU                   : 3345.7    1196.5    1582   75.6% 
   4 Critter 1.6a 64-bit 4CPU                   : 3308.2     980.0    1450   67.6% 
   5 Houdini 3 64-bit                           : 3306.7     317.5     472   67.3% 
   6 Stockfish 2.3.1 64-bit 4CPU                : 3298.7     566.0     951   59.5% 
   7 Critter 1.2 64-bit 4CPU                    : 3287.8     913.5    1331   68.6% 
   8 Rybka 4.1 64-bit 4CPU                      : 3283.2    1375.0    2072   66.4% 
.................................................................................................................. 
1198 MicroChess 1976                            :  451.4     110.5     196   56.4% 
1199 NEG 0.3d                                   :  402.8     164.0     435   37.7% 
1200 Ram 2.0                                    :  375.2     144.5     435   33.2% 
1201 LaMoSca 0.10                               :  305.4      89.0     286   31.1% 
1202 CPP1                                       :  282.2      73.0     255   28.6% 
1203 ACE 0.1                                    :  144.9      67.0     473   14.2% 
1204 POS 1.20                                   :  110.9      55.5     298   18.6% 
1205 Brutus RND                                 :    0.0      32.0     306   10.5%



Now, here is the Bayeselo ratings using the computed drawelo and default scale parameter:

Code: Select all

Rank Name                                      Elo    +    - games score oppo. draws 
   1 Houdini 3 64-bit 4CPU                    3177   17   17  1644   80%  2944   29% 
   2 Houdini 2.0c 64-bit 4CPU                 3127   14   14  2465   74%  2943   32% 
   3 Houdini 1.5a 64-bit 4CPU                 3117   17   17  1582   76%  2920   31% 
   4 Houdini 3 64-bit                         3081   27   27   472   67%  2962   40% 
   5 Critter 1.6a 64-bit 4CPU                 3080   16   16  1450   68%  2956   42% 
   6 Stockfish 2.3.1 64-bit 4CPU              3071   19   19   951   60%  3009   47% 
   7 Critter 1.2 64-bit 4CPU                  3060   17   17  1331   69%  2925   40% 
   8 Rybka 4.1 64-bit 4CPU                    3058   14   14  2072   66%  2940   39% 
............................................................................................................... 
1198 MicroChess 1976                           373   55   55   196   56%   322   31% 
1199 NEG 0.3d                                  330   45   45   435   38%   473   29% 
1200 Ram 2.0                                   311   46   46   435   33%   489   30% 
1201 LaMoSca 0.10                              257   49   49   286   31%   449   61% 
1202 CPP1                                      227   54   54   255   29%   447   23% 
1203 ACE 0.1                                   111   52   52   473   14%   609   22% 
1204 POS 1.20                                   80   56   56   298   19%   462   22% 
1205 Brutus RND                                  0   60   60   306   10%   462   21%
Michel wrote:Thanks. This is interesting. The difference seems to be about 10% over the entire ELO scale.
It could be interesting not only to make a comparison of [(maximum rating of Ordo) - (minimum rating of Ordo)]/[(maximum rating of BayesElo) - (minimum rating of BayesElo)] but also the distribution of ratings in an adimensional way. I propose the following math:

Code: Select all

o_i = [(Ordo rating_i) - (minimum rating of Ordo)]/[(maximum rating of Ordo) - (minimum rating of Ordo)]
b_i = [(BayesElo rating_i) - (minimum rating of BayesElo)]/[(maximum rating of BayesElo) - (minimum rating of BayesElo)]
And then plot strings {o_1 = 1, o_2, ..., o_1204, o_1205 = 0} and {b_1 = 1, b_2, ..., b_1204, b_1205 = 0} and compare them. In these case: o_i = (Ordo rating_i)/3417.5; b_i = (BayesElo rating_i)/3177. For top and bottom engines (rounding up to 0.0001):

Code: Select all

     Engine:                        o_i         b_i

Houdini 3 64-bit 4CPU              1.0000      1.0000
Houdini 2.0c 64-bit 4CPU           0.9832      0.9843
Houdini 1.5a 64-bit 4CPU           0.9790      0.9811
Critter 1.6a 64-bit 4CPU           0.9680      0.9695
Houdini 3 64-bit                   0.9676      0.9698
Stockfish 2.3.1 64-bit 4CPU        0.9652      0.9666
Critter 1.2 64-bit 4CPU            0.9620      0.9632
Rybka 4.1 64-bit 4CPU              0.9607      0.9625
[...]
MicroChess 1976                    0.1321      0.1174
NEG 0.3d                           0.1179      0.1039
Ram 2.0                            0.1098      0.0979
LaMoSca 0.10                       0.0894      0.0809
CPP1                               0.0826      0.0715
ACE 0.1                            0.0424      0.0349
POS 1.20                           0.0325      0.0252
Brutus RND                         0.0000      0.0000
I did the calculations with a Casio calculator, so this table may contain errors.

You can see that differences between columns are not negligible at all. Other thing is if that table is useful or useless.

Regards from Spain.

Ajedrecista.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Also-Rans list updated

Post by lucasart »

Michel wrote:If you run a test that way then the (scaled) BayesElo differences will be smaller than the logistic ones (which I assume you want to fix at 500 Elo).

BayesElo and Logistic are simply different Elo models which are also both different from the "real" model which needs to take into account electron spin etc.... Scaling is an attempt to match both models in the realistic scenario that the great majority of games is between engines which are not too far apart elowise.
If I understand correctly, the scale used by BayesELO is used to transform bayes ELO into ELO and vice versa. The transformation that cannot be done analytically is to transform (ELO, DrawELO) --> BayesELO. Remi uses an approximate formula which is designed to work for small values of ELO, but if we were to replace this scale() function by a stupid dichotomic, wouldn't that improve at least the approx part of the scaling? It still wouldn't make the results match, but perhaps they would get closer.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
hgm
Posts: 28457
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Also-Rans list updated

Post by hgm »

Michel wrote:If you run a test that way then the (scaled) BayesElo differences will be smaller than the logistic ones (which I assume you want to fix at 500 Elo).
BaysElo does use a logistic distribution, doesn't it?
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Also-Rans list updated

Post by Michel »

BaysElo does use a logistic distribution, doesn't it?
Yes but the incorporation of draw_elo changes the derivative of elo versus expected score.
The default scaling (which is a function of draw_elo) makes the derivative
of logistic and the modified version used by BayesElo equal for elo=0.

But this has the effect that for large elo differences where the effect of draw_elo is less important BayesElo ratings become smaller than logistic (because they are scaled).

Now large elo differences are not measured directly but they are a composition
of many small elo differences. This makes the effect of the "rating compression" much
less pronounced. According to Adam the difference would be about 10%.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Also-Rans list updated

Post by Michel »

The transformation that cannot be done analytically is to transform (ELO, DrawELO) --> BayesELO.
Do you mean, (fixing DrawElo), computing analytically

LogisticElo--->ExpectedScore--->BayesElo

In other words computing analytically

ExpectedScore--->BayesElo

It seems to me that the fact this function cannot be computed analytically is not really a problem since it can be trivially computed numerically (it is the inverse of a nice monotonically increasing function).

Nonetheless it seems like a nice mathematical problem to find an approximate formula. I'll think about it.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Also-Rans list updated

Post by Michel »

Another approach might be simple curve fitting.
Let f be the function

LogisticElo--->UnscaledBayesElo

What do we know about f?

(1) f(0)=0
(2) f'(0) can be computed (I did not bother to do it yet).
(3) f(-x)= -f(x) (f is "odd") (this implies (1) of course).
(4) f(x)=x for x=plus or minus infinity

Put f'(0)=a. The simplest function that I could come up with that has these properties
is

f(x)=(x^3+ax)/(x^2+1)

I did not check if this particular function works well.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Also-Rans list updated

Post by Laskos »

Michel wrote:Another approach might be simple curve fitting.
Let f be the function

LogisticElo--->UnscaledBayesElo

What do we know about f?

(1) f(0)=0
(2) f'(0) can be computed (I did not bother to do it yet).
(3) f(-x)= -f(x) (f is "odd") (this implies (1) of course).
(4) f(x)=x for x=plus or minus infinity

Put f'(0)=a. The simplest function that I could come up with that has these properties
is

f(x)=(x^3+ax)/(x^2+1)

I did not check if this particular function works well.
Are you sure f(x)=x for x infinity? For the same ExpectedScore -> 1, f(x) could be b*x, with b>0. Or even non-linear.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Also-Rans list updated

Post by Michel »

Are you sure f(x)=x for x infinity?
No perhaps not! Thanks!! There seems to be an offset. It can be found by developing
both BayesElo and Logistic scores near +infinity in terms of 10^(-x/400).

I have no time now to do a more precise computation.

So my simple minded function would need to be adapted to take this offset into account.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Also-Rans list updated

Post by lucasart »

Michel wrote:
The transformation that cannot be done analytically is to transform (ELO, DrawELO) --> BayesELO.
Do you mean, (fixing DrawElo), computing analytically

LogisticElo--->ExpectedScore--->BayesElo

In other words computing analytically

ExpectedScore--->BayesElo
Yes. The reverse operation is trivial, since
(BayesELO, DrawELO) <--> (P(win), P(loss)) --> ELO
[arrow means can be computed analytically.]
Michel wrote: It seems to me that the fact this function cannot be computed analytically is not really a problem since it can be trivially computed numerically (it is the inverse of a nice monotonically increasing function).
Exactly, as I said, it can be done with a simple dichotomy. But that's not what BayesELO does, and the scale used by BayesELO will be wrong for large elo differences. My naive thought was basically: can't we just plug BayesELO and make it use a dichotomy instead of the local scale formula?
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Also-Rans list updated

Post by Michel »

Exactly, as I said, it can be done with a simple dichotomy. But that's not what BayesELO does, and the scale used by BayesELO will be wrong for large elo differences. My naive thought was basically: can't we just plug BayesELO and make it use a dichotomy instead of the local scale formula?
I don't think the issue is fixable (although I would be happy to be proven wrong). Logistic and BayesElo are simply different models and the values they give cannot be related by a single formula since they depend on the elo distribution of the games that are used as input.

It simply makes a big difference if you measure large elo differences directly or as a composition of many small differences.

Now the second scenario is the most realistic one and it is taking care off by the default scaling in BayesElo. From Adam Hair's data it seems this actually works quite well although perhaps the scaling should be tweaked a bit to give even better matching.