Problem with Normalized ELO for large ELO span

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Problem with Normalized ELO for large ELO span

Post by Laskos »

Normalized ELO of Michel (I will write it as Norm_ELO)
http://hardy.uhasselt.be/Toga/normalized_elo.pdf
has nice statistical properties, like for example, inverse square of it giving the number of games to the same statistical significance of the result. But for one condition for it to be consistent in rating schemes, one has to check for the additivity in the case of several engines:

Engines 1,2,3
Norm_ELO(2,1) + Norm_ELO(3,2) = Norm_ELO(3,1)

I used a database to check whether the Logistic ELO or the Gaussian ELO are more adequate for engines in this thread:
http://www.talkchess.com/forum/viewtopic.php?t=60791

As I will use the same database (and one more of past weekend), I will repost the OP:

I let play a massive amount of games (total 105,000) in round-robin at fixed nodes between different engines like Stockfish, Texel, Andscacs, etc. for accuracy. The engines were distanced between themselves by an order of 200 ELO points each, so that each individual ELO interval between them is almost linear in ELO-score and independent of the ELO model. The largest total difference between engines was of order of 1400 ELO points, I needed large differences because large differences between ELO models occur for large ELO differences. For each individual match I computed the total Logistic ELO difference, on large ELO intervals. This is the horizontal axis. Then, the consistent ELO is the sum of small differences between engines cumulated to give the total difference. If the Logistic model is consistent these two should be equal, and the diagonal from (0,0) to (1400,1400) would be the fit. If the Gaussian or other model is more consistent, the dots should deviate from the diagonal. They do not very much. Gaussian model seems ruled out, and Logistic ELO model for computer chess engines seems to stand well on this try. My earlier results were mixed because of fewer data points and fewer games for each data point.

The data:

Code: Select all

Individual statistics:

1 SF2                       : 2381  35000 (+32134,=1275,-1591), 93.6 %

T1                            : 7000 (+6950,= 44,-  6), 99.6 %
Ha1                           : 7000 (+6996,=  4,-  0), 100.0 %
T2                            : 7000 (+4574,=969,-1457), 72.3 %
R2                            : 7000 (+6625,=248,-127), 96.4 %
R1                            : 7000 (+6989,= 10,-  1), 99.9 %

2 T2                        : 2232  35000 (+28760,=1323,-4917), 84.1 %

SF2                           : 7000 (+1457,=969,-4574), 27.7 %
T1                            : 7000 (+6968,= 29,-  3), 99.8 %
Ha1                           : 7000 (+6991,=  8,-  1), 99.9 %
R2                            : 7000 (+6355,=308,-337), 93.0 %
R1                            : 7000 (+6989,=  9,-  2), 99.9 %

3 R2                        : 2051  35000 (+20528,=1016,-13456), 60.1 %

SF2                           : 7000 (+127,=248,-6625),  3.6 %
T1                            : 7000 (+6302,=332,-366), 92.4 %
Ha1                           : 7000 (+6910,= 45,- 45), 99.0 %
T2                            : 7000 (+337,=308,-6355),  7.0 %
R1                            : 7000 (+6852,= 83,- 65), 98.5 %

4 T1                        : 1898  35000 (+11060,=1952,-21988), 34.4 %

SF2                           : 7000 (+  6,= 44,-6950),  0.4 %
Ha1                           : 7000 (+5750,=554,-696), 86.1 %
T2                            : 7000 (+  3,= 29,-6968),  0.2 %
R2                            : 7000 (+366,=332,-6302),  7.6 %
R1                            : 7000 (+4935,=993,-1072), 77.6 %

5 R1                        : 1778  35000 (+5667,=1666,-27667), 18.6 %

SF2                           : 7000 (+  1,= 10,-6989),  0.1 %
T1                            : 7000 (+1072,=993,-4935), 22.4 %
Ha1                           : 7000 (+4527,=571,-1902), 68.8 %
T2                            : 7000 (+  2,=  9,-6989),  0.1 %
R2                            : 7000 (+ 65,= 83,-6852),  1.5 %

6 Ha1                       : 1661  35000 (+2644,=1182,-31174),  9.2 %

SF2                           : 7000 (+  0,=  4,-6996),  0.0 %
T1                            : 7000 (+696,=554,-5750), 13.9 %
T2                            : 7000 (+  1,=  8,-6991),  0.1 %
R2                            : 7000 (+ 45,= 45,-6910),  1.0 %
R1                            : 7000 (+1902,=571,-4527), 31.2 %
The plot:

Image
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Problem with Normalized ELO for large ELO span

Post by Laskos »

With the same database I computed Norm_ELO (very large maximum ELO span of about 1500 ELO points):

Code: Select all

Opponets       Normalized ELO     Games     1.96*sigma
====================================================================

===============================
('SF2 vs T1', (10.098144625114557, 7000, 0.023426480742954114))
--------------------------------
('SF2 vs T2', (0.5467469418819643, 7000, 0.023426480742954114))
('T2 vs R2',  (1.8460056693806666, 7000, 0.023426480742954114))
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))



================================
('SF2 vs Ha1',(41.821047332652974, 7000, 0.023426480742954114))
--------------------------------
('SF2 vs T2', (0.5467469418819643, 7000, 0.023426480742954114))
('T2 vs R2',  (1.8460056693806666, 7000, 0.023426480742954114))
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))
('R1 vs Ha1', (0.4252029816341901, 7000, 0.023426480742954114))



================================
('SF2 vs R2', (2.8944403984992437, 7000, 0.023426480742954114))
--------------------------------
('SF2 vs T2', (0.5467469418819643, 7000, 0.023426480742954114))
('T2 vs R2',  (1.8460056693806666, 7000, 0.023426480742954114))



================================
('SF2 vs R1', (22.338765368634377, 7000, 0.023426480742954114))
--------------------------------
('SF2 vs T2', (0.5467469418819643, 7000, 0.023426480742954114))
('T2 vs R2',  (1.8460056693806666, 7000, 0.023426480742954114))
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))



================================
('T2 vs T1',  (13.02893759835486, 7000, 0.023426480742954114))
--------------------------------
('T2 vs R2', (1.8460056693806666, 7000, 0.023426480742954114))
('R2 vs T1', (1.7550225550420455, 7000, 0.023426480742954114))



================================
('T2 vs Ha1', (24.132159957602394, 7000, 0.023426480742954114))
--------------------------------
('T2 vs R2',  (1.8460056693806666, 7000, 0.023426480742954114))
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))
('R1 vs Ha1', (0.4252029816341901, 7000, 0.023426480742954114))



================================
('T2 vs R1',  (20.268698723430518, 7000, 0.023426480742954114))
--------------------------------
('T2 vs R2',  (1.8460056693806666, 7000, 0.023426480742954114))
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))



================================
('R2 vs Ha1',  (5.502089077265169, 7000, 0.023426480742954114))
--------------------------------
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))
('R1 vs Ha1', (0.4252029816341901, 7000, 0.023426480742954114))



================================
('R2 vs R1',   (4.422055802547341, 7000, 0.023426480742954114))
--------------------------------
('R2 vs T1',  (1.7550225550420455, 7000, 0.023426480742954114))
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))



================================
('T1 vs Ha1', (1.1421918389289303, 7000, 0.023426480742954114))
--------------------------------
('T1 vs R1',  (0.7417029054404695, 7000, 0.023426480742954114))
('R1 vs Ha1', (0.4252029816341901, 7000, 0.023426480742954114))
================================

We see that on large ELO spans, Norm_ELO is highly non-additive. And it is additive on the last datapoint, which is for small ELO span (compared to 1500 ELO). This weekend I wanted to check this additivity for small ELO span, and I got the following database of 24,000 games (at fixed nodes too):

Code: Select all

Individual statistics:

1 AND1                      :   27  16000 (+7469,=2917,-5614), 55.8 %

TEX1                          : 8000 (+3664,=1354,-2982), 54.3 %
SF1                           : 8000 (+3805,=1563,-2632), 57.3 %

2 TEX1                      :   -0  16000 (+6570,=2846,-6584), 50.0 %

AND1                          : 8000 (+2982,=1354,-3664), 45.7 %
SF1                           : 8000 (+3588,=1492,-2920), 54.2 %

3 SF1                       :  -27  16000 (+5552,=3055,-7393), 44.2 %

AND1                          : 8000 (+2632,=1563,-3805), 42.7 %
TEX1                          : 8000 (+2920,=1492,-3588), 45.8 %
With additivity for Norm_ELO roughly satisfied:

Code: Select all

Opponets          Normalized ELO     Games     1.96*sigma
====================================================================

================================
('AND1 vs SF1',  (0.1656884264095710, 8000, 0.021913466179497937))
--------------------------------
('AND1 vs TEX1', (0.0939436039539778, 8000, 0.021913466179497937))
('TEX1 vs SF1',  (0.0929772759705353, 8000, 0.021913466179497937))
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Problem with Normalized ELO for large ELO span

Post by Laskos »

With all this data, and knowledge that ELO (not Norm_ELO) is additive on a logistic, I roughly inverted the logistic, and a rough expression log (1 + Norm_ELO) came as possibly additive. In fact it is a bit more complicated, Norm_ELO, like ELO, can be negative and large too, so the correct way of putting the expression is: sign(Norm_Elo) * log (1 + Abs[Norm_ELO]), but let's skip this. I checked for consistency of additivity of log (1 + Norm_ELO) on large ELO spans like I did in the case of Logistic versus Gaussian:
Image

The consistency is not perfect, but very good. I am not sure how to call it, additivity of log (1 + Norm_ELO) or multiplicativity of 1 + Norm _ELO. The latter seems to have a better statistical interpretation. Maybe Michel can see better expressions to have these properties, as I just roughly inverted what ELO does with logistic. For small ELO differences, Norm_ELO can be used additively as it is, but for rating of wildly ELO spread engines, it cannot.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Problem with Normalized ELO for large ELO span

Post by Laskos »

Laskos wrote:With all this data, and knowledge that ELO (not Norm_ELO) is additive on a logistic, I roughly inverted the logistic, and a rough expression log (1 + Norm_ELO) came as possibly additive. In fact it is a bit more complicated, Norm_ELO, like ELO, can be negative and large too, so the correct way of putting the expression is: sign(Norm_Elo) * log (1 + Abs[Norm_ELO]), but let's skip this. I checked for consistency of additivity of log (1 + Norm_ELO) on large ELO spans like I did in the case of Logistic versus Gaussian:
Image

The consistency is not perfect, but very good. I am not sure how to call it, additivity of log (1 + Norm_ELO) or multiplicativity of 1 + Norm _ELO. The latter seems to have a better statistical interpretation. Maybe Michel can see better expressions to have these properties, as I just roughly inverted what ELO does with logistic. For small ELO differences, Norm_ELO can be used additively as it is, but for rating of wildly ELO spread engines, it cannot.
Seeing that (1 + Norm_ELO) obeys multiplicativity, "Multiplicative Rating" (Mult_Rating) can be proposed. It is not extremely nice on some grounds (problem with negative Norm_ELO), but scales better than ELO (like Norm_ELO), has statistical interpretation related to Norm_ELO, and is multiplicative (rating differences are seen by dividing ratings). Mult_Ratings range always from 0 to infinity. The square of the deviation from 1 in Mult_Ratings is giving the inverse of number of games necessary to separate statistically engines in strength..

N = number of games

For positive ELO difference or Norm_ELO,
Mult_Rating = 1 + Norm_ELO
The standard deviation is 1/sqrt(N), identical to that of Norm_ELO

For negative ELO difference or or Norm_ELO,
Mult_Rating = 1 / (1 + Abs[Norm_ELO])
The standard deviation is tricky in this case, and I cannot write it generally, but using that 1/sqrt(N) is always much smaller than (1 + Abs[Norm_ELO]), by Taylor expansion, standard deviation = 1 / [sqrt(N) * (1 + Abs[Norm_ELO)]**2] or (Mult_Rating)**2 / sqrt(N). That's the ugliest property of the Mult_Rating.

Example of such rating compared to regular ELO rating from FGRL 60min list:

ELO Ordo:

Code: Select all

   # PLAYER              : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)

   1 Komodo 11.01        :  142.5   11.5     945.0    1350    70.0      63    
   2 Stockfish 8         :  139.6   11.7     940.0    1350    69.6     100    
   3 Houdini 5.01        :  116.1   11.0     898.5    1350    66.6     100    
   4 Deep Shredder 13    :   11.1   10.4     698.5    1350    51.7     100    
   5 Fire 5              :   -9.2   10.2     658.5    1350    48.8     100    
   6 Andscacs 0.91       :  -47.0   10.2     584.5    1350    43.3      82    
   7 Fizbo 1.9           :  -54.2   10.5     570.5    1350    42.3      78    
   8 Gull 3              :  -60.4   10.3     558.5    1350    41.4     100    
   9 Chiron 4            : -113.2   11.2     459.0    1350    34.0      93    
  10 Fritz 15            : -125.4   10.8     437.0    1350    32.4     ---

Mult_Rating:

Code: Select all

   # PLAYER              : RATING  ERROR    POINTS  PLAYED     (%)

   1 Stockfish 8         :  1.716  0.053     940.0    1350    69.6  
   2 Komodo 11.01        :  1.714  0.053     945.0    1350    70.0      
   3 Houdini 5.01        :  1.597  0.053     898.5    1350    66.6    
   4 Deep Shredder 13    :  1.055  0.053     698.5    1350    51.7    
   5 Fire 5              :  0.963  0.049     658.5    1350    48.8    
   6 Andscacs 0.91       :  0.829  0.036     584.5    1350    43.3     
   7 Fizbo 1.9           :  0.807  0.035     570.5    1350    42.3    
   8 Gull 3              :  0.784  0.033     558.5    1350    41.4    
   9 Chiron 4            :  0.667  0.024     459.0    1350    34.0   
  10 Fritz 15            :  0.635  0.021     437.0    1350    32.4
To see the difference in ratings in Mult_Raring, one needs to divide ratings, not subtract, like in regular ELO. As you can see, the error (1.96 standard deviation) behaves a bit weird in Mult_Rating for weaker part of engines. I brought all this Mult_Rating up because simple Norm_ELO cannot be used in such ratings lists, it is neither additive, nor multiplicative on larger ELO spans.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Problem with Normalized ELO for large ELO span

Post by Laskos »

Laskos wrote:

Mult_Rating:

Code: Select all

   # PLAYER              : RATING  ERROR    POINTS  PLAYED     (%)

   1 Stockfish 8         :  1.716  0.053     940.0    1350    69.6  
   2 Komodo 11.01        :  1.714  0.053     945.0    1350    70.0      
   3 Houdini 5.01        :  1.597  0.053     898.5    1350    66.6    
   4 Deep Shredder 13    :  1.055  0.053     698.5    1350    51.7    
   5 Fire 5              :  0.963  0.049     658.5    1350    48.8    
   6 Andscacs 0.91       :  0.829  0.036     584.5    1350    43.3     
   7 Fizbo 1.9           :  0.807  0.035     570.5    1350    42.3    
   8 Gull 3              :  0.784  0.033     558.5    1350    41.4    
   9 Chiron 4            :  0.667  0.024     459.0    1350    34.0   
  10 Fritz 15            :  0.635  0.021     437.0    1350    32.4
To see the difference in ratings in Mult_Raring, one needs to divide ratings, not subtract, like in regular ELO. As you can see, the error (1.96 standard deviation) behaves a bit weird in Mult_Rating for weaker part of engines. I brought all this Mult_Rating up because simple Norm_ELO cannot be used in such ratings lists, it is neither additive, nor multiplicative on larger ELO spans.
And to clarify a bit the interpretation of ratings:

1/ If Mult_Rating_2 / Mult_Rating_1 = 1, the engines are equal in strength according to Mult_Rating (and Norm_ELO). The difference is given by the magnitude of difference from 1.

2/ The error of Mult_Rating difference:

Mult_Rating_2 / Mult_Rating_1,

can be derived as following:

For two normally distributed A,B
Variance(A*B) = Variance(A)*Variance(B) + Variance(A)*(Mean(B))**2 + Variance(B)*(Mean(A))**2

In our case, Variance(A)*Variance(B) is negligible (1/N is almost always much smaller than Mult_Rating**2)

Mean(A) = Mult_Rating_2
Variance(A) = Variance_2

Mean(B) = 1/Mult_Rating_1
Variance(B) = Variance_1 / Mult_Rating_1**4

Variance(Mult_Raing_2 / Mult_Rating_1) = Variance_2 * Mult_Rating_1**2 + Variance_1 * Mult_rating_2**2 / Mult_Rating_1**4
Not that bad, if not a bit clumsy compared to additive ELO calculations.

In our list: Stockfish 8 versus Komodo 11.01 - the difference in Mult_rating:
1.001 +/- 0.096 (95% confidence)

Stockfish 8 versus Fritz 15
2.702 +/- 0.095 (95% confidence)

All of this is clumsy for regular developer or user, but only that I can do to use Normalized ELO for large ELO differences.