1/ Wilo rating doesn't compress or dilate ratings from STC to LTC. Elo rating does compress ratings from STC to LTC
2/ Wilo rating give higher LOS (p-values), showing more sensitivity than Elo rating, therefore less games are needed for Wilo rating to show significant differences between engines than with Elo model
Therefore considering draws as non-games is better both as calibration of ratings (no time control dependence) and as number of games needed for significance.
------------------------------------
Results from FGRL rating lists ( http://www.fastgm.de/ ):
------------------------------------
I/ Compression/Dilation of ratings
Elo ratings:
Mean deviation of ratings 60''+0.6'': 119 Elo points
Mean deviation of ratings 60'+15'': 84 Elo points
Compression: 29.5% +/- 4% ---> significant compression
Wilo ratings (discarding draws):
Mean deviation of ratings 60''+0.6'': 251 Wilo points
Mean deviation of ratings 60'+15'': 260 Wilo points
Dilation: 3.6% +/- 8% ---> insignificant dilation
We see that Wilo ratings are within error margins of not compressing or dilating with 60x time control. Elo ratings do compress significantly at 60x time control
II/ Norm (regular Frobenius) of LOS (p-values) matrix.
The minimal Norm of p-value matrix is 0 (all engines are perfectly equal in strength, LOS=50% between any two engines, Null Hypothesis is accepted with 100% probability).
The maximal Norm of p-value 10x10 symmetric matrix is sqrt(90) ~ 9.487 (all engines are clearly separated by strength, LOS=100% between any two engines, Null Hypothesis is rejected with 100% probability)
Elo list 60''+0.6'': Norm of p-value matrix is 9.15
Wilo list 60''+0.6'': Norm of p-value matrix is 9.22
Wilo list shows generally higher LOS values (more sensitivity).
Elo list 60'+15'': Norm of p-value matrix is 9.09
Wilo list 60'+15'': Norm of p-value matrix is 9.13
Wilo list shows again higher sensitivity.
Both confirm that Wilo ratings show more sensitivity, therefore less games are needed for the same confidence compared to Elo ratings.
III/ Scaling
With Wilo empirically shown as better rating system than Elo in computer chess with our data from FGRL, not depending on time control and having higher sensitivity, I put in table the scaling of top 10 engines between bullet (60''+0.6'') and LTC (60'+15''), or roughly 60x time factor:
Code: Select all
     Engine                    Scaling Wilo
  --------------------------------------------
   1 Andscacs 0.90       :          91
   2 Komodo 10.4         :          39
   3 Stockfish 8         :          30
   4 Deep Shredder 13    :          19
   5 Fire 5              :          18
   6 Gull 3              :           2
   7 Houdini 5.01        :         -22
   8 Chiron 4            :         -24
   9 Fizbo 1.9           :         -76
  10 Fritz 15            :         -77
