Fishy Bayeselo numbers?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

tpoppins
Posts: 919
Joined: Tue Nov 24, 2015 9:11 pm
Location: upstate

Fishy Bayeselo numbers?

Post by tpoppins »

Questions have been raised about Elo differences and error margins calculated with Bayeselo in the following threads:

SF-McBrain v3.0 TCEC-X RELEASE
Noomen KI Fianchetto H6.02 vs. K11.2.2 30m+30s 16-core
ernest
Posts: 2041
Joined: Wed Mar 08, 2006 8:30 pm

Re: Fishy Bayeselo numbers?

Post by ernest »

Indeed, for example,
how can a match with final score: 56-44 (+15=82-3),
leading to the "usual" (Elostat) +42 Elo difference,

gives a measly +15 Elo using the BayesElo computation !
(or does the +15 for the winner, -15 for the loser mean a +30 Elo difference, still quite different from +42, but less startling...)

And results can also be very different concerning the error-bars (2-sigma).
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: Fishy Bayeselo numbers?

Post by MikeB »

ernest wrote:Indeed, for example,
how can a match with final score: 56-44 (+15=82-3),
leading to the "usual" (Elostat) +42 Elo difference,

gives a measly +15 Elo using the BayesElo computation !
(or does the +15 for the winner, -15 for the loser mean a +30 Elo difference, still quite different from +42, but less startling...)

And results can also be very different concerning the error-bars (2-sigma).
this just illustrates the different settings that can be used - this pgn file is the current tcec 10 pgn file:

[Mac-Pro:~/cluster.mfb]

Code: Select all

michaelbyrne% bay
version 0058, Copyright (C) 1997-2016 Remi Coulom and updated by Michael Byrne.
compiled Jul 24 2016 00:03:35.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http://www.gnu.org/copyleft/gpl.html for details.

ResultSet>
ResultSet>rp /Users/michaelbyrne/Downloads/dl.php-7.pgn 
112 game(s) loaded
ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>r
Rank Name                  Rating   Δ     +    -     #     Σ    Σ%     W    L    D   W%    =%   OppR 
---------------------------------------------------------------------------------------------------------
   1 Stockfish 041017       3394   0.0  211  211     9    8.5  94.4    8    0    1  88.9  11.1  3105 
   2 Komodo 1937.00         3302  92.5  174  174    10    8.5  85.0    7    0    3  70.0  30.0  3089 
   3 Fire 6.1               3212  89.5  180  180     9    6.0  66.7    5    2    2  55.6  22.2  3109 
   4 Houdini 6.02           3200  11.7  155  155    10    7.0  70.0    4    0    6  40.0  60.0  3086 
   5 Ginkgo 2               3188  12.3  170  170     9    6.0  66.7    4    1    4  44.4  44.4  3101 
   6 Chiron 040917          3187   1.4  160  160     9    5.5  61.1    3    1    5  33.3  55.6  3139 
   7 Andscacs 0.92          3182   4.9  158  158     9    6.0  66.7    3    0    6  33.3  66.7  3072 
   8 Jonny 8.1              3180   2.0  172  172     9    5.0  55.6    4    3    2  44.4  22.2  3154 
   9 Bobcat 8               3158  21.4  168  168     9    6.0  66.7    4    1    4  44.4  44.4  3060 
  10 Vajolet2 2.3.2         3142  16.7  155  155     9    5.0  55.6    2    1    6  22.2  66.7  3110 
  11 Hannibal 121017        3136   5.9  159  159     9    4.5  50.0    2    2    5  22.2  55.6  3132 
  12 Gull 3                 3133   2.4  167  167     9    6.0  66.7    4    1    4  44.4  44.4  3038 
  13 Booot 6.2              3114  19.9  150  150    10    6.0  60.0    3    1    6  30.0  60.0  3060 
  14 Nirvana 2.4            3104   9.1  154  154    10    4.5  45.0    2    3    5  20.0  50.0  3128 
  15 Texel 1.07a35          3092  12.8  158  158    10    5.0  50.0    3    3    4  30.0  40.0  3076 
  16 Fizbo 1.91             3091   1.0  179  179     9    4.0  44.4    3    4    2  33.3  22.2  3112 
  17 Wasp 2.5               3055  35.3  170  170     9    4.0  44.4    2    3    4  22.2  44.4  3084 
  18 Rybka 4.1              3037  18.2  161  161     9    3.5  38.9    1    3    5  11.1  55.6  3102 
  19 Gaviota 1.01           2983  54.0  172  172     9    2.5  27.8    1    5    3  11.1  33.3  3099 
  20 Arasan 20.2            2977   6.2  176  176    10    3.0  30.0    2    6    2  20.0  20.0  3099 
  21 Fruit 3.2              2969   7.9  182  182     9    2.5  27.8    2    6    1  22.2  11.1  3097 
  22 Nemorino 3.04          2894  75.0  186  186    10    1.0  10.0    0    8    2   0.0  20.0  3132 
  23 Laser 200917           2859  35.2  202  202     9    1.0  11.1    0    7    2   0.0  22.2  3099 
  24 Hakkapeliitta 210416   2811  48.1  211  211    10    1.0  10.0    1    9    0  10.0   0.0  3089 
---------------------------------------------------------------------------------------------------------
  Δ = delta from the next higher rated opponent
  # = number of games played
  Σ = total score, 1 point for win, 1/2 point for draw
these were the settings I used which are embodied in the script I use to call bayeselo

Code: Select all

ResultSet-EloRating>x ## this moves back the menu
ResultSet>reset ##this resets bayeselo to clean state
ResultSet>rp /Users/michaelbyrne/Downloads/dl.php-7.pgn
112 game(s) loaded
ResultSet>elo
ResultSet-EloRating>mm 1 1  ## what I normally use
Iteration 100: 3e-05 
00:00:00,00
ResultSet-EloRating>covariance  ## what I normally use
ResultSet-EloRating>r  
Rank Name                  Rating   Δ     +    -     #     Σ    Σ%     W    L    D   W%    =%   OppR 
---------------------------------------------------------------------------------------------------------
   1 Stockfish 041017       3360   0.0  117  117     9    8.5  94.4    8    0    1  88.9  11.1  3105 
   2 Komodo 1937.00         3285  75.0  102  102    10    8.5  85.0    7    0    3  70.0  30.0  3089 
   3 Fire 6.1               3206  79.2  110  110     9    6.0  66.7    5    2    2  55.6  22.2  3109 
   4 Houdini 6.02           3193  13.0  102  102    10    7.0  70.0    4    0    6  40.0  60.0  3087 
   5 Chiron 040917          3190   3.2  106  106     9    5.5  61.1    3    1    5  33.3  55.6  3134 
   6 Ginkgo 2               3189   0.7  107  107     9    6.0  66.7    4    1    4  44.4  44.4  3100 
   7 Jonny 8.1              3175  14.2  106  106     9    5.0  55.6    4    3    2  44.4  22.2  3148 
   8 Andscacs 0.92          3173   2.0  106  106     9    6.0  66.7    3    0    6  33.3  66.7  3077 
   9 Bobcat 8               3153  20.5  103  103     9    6.0  66.7    4    1    4  44.4  44.4  3066 
  10 Vajolet2 2.3.2         3136  16.5  104  104     9    5.0  55.6    2    1    6  22.2  66.7  3109 
  11 Gull 3                 3131   5.6  101  101     9    6.0  66.7    4    1    4  44.4  44.4  3045 
  12 Hannibal 121017        3128   2.9  103  103     9    4.5  50.0    2    2    5  22.2  55.6  3128 
  13 Booot 6.2              3114  13.3   95   95    10    6.0  60.0    3    1    6  30.0  60.0  3063 
  14 Nirvana 2.4            3099  15.2   93   93    10    4.5  45.0    2    3    5  20.0  50.0  3127 
  15 Texel 1.07a35          3086  12.7  101  101    10    5.0  50.0    3    3    4  30.0  40.0  3077 
  16 Fizbo 1.91             3081   5.8  105  105     9    4.0  44.4    3    4    2  33.3  22.2  3110 
  17 Wasp 2.5               3063  17.6  105  105     9    4.0  44.4    2    3    4  22.2  44.4  3086 
  18 Rybka 4.1              3040  23.3  100  100     9    3.5  38.9    1    3    5  11.1  55.6  3102 
  19 Arasan 20.2            2992  48.1  102  102    10    3.0  30.0    2    6    2  20.0  20.0  3101 
  20 Gaviota 1.01           2986   5.5   99   99     9    2.5  27.8    1    5    3  11.1  33.3  3098 
  21 Fruit 3.2              2974  12.5  102  102     9    2.5  27.8    2    6    1  22.2  11.1  3096 
  22 Nemorino 3.04          2918  56.1  104  104    10    1.0  10.0    0    8    2   0.0  20.0  3129 
  23 Laser 200917           2872  45.3  118  118     9    1.0  11.1    0    7    2   0.0  22.2  3100 
  24 Hakkapeliitta 210416   2856  16.2  115  115    10    1.0  10.0    1    9    0  10.0   0.0  3090 
---------------------------------------------------------------------------------------------------------
  Δ = delta from the next higher rated opponent
  # = number of games played
  Σ = total score, 1 point for win, 1/2 point for draw

ResultSet-EloRating>
any experts out there?
User avatar
Ajedrecista
Posts: 1968
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Fishy Bayeselo numbers?

Post by Ajedrecista »

Hello:

Just my two cents.
ernest wrote:Indeed, for example,
how can a match with final score: 56-44 (+15=82-3),
leading to the "usual" (Elostat) +42 Elo difference,

gives a measly +15 Elo using the BayesElo computation !
(or does the +15 for the winner, -15 for the loser mean a +30 Elo difference, still quite different from +42, but less startling...)

And results can also be very different concerning the error-bars (2-sigma).
Be sure that it means 15 - (-15) = 30 Bayeselo difference. Please remember that 1 Bayeselo =/= 1 logistic Elo!

The draw ratio is very high, so unexpected results might show.

I tried the following version of Bayeselo with the 100-game PGN provided by Michael:

Code: Select all

version 0057.2, Copyright (C) 1997-2010 Remi Coulom.
compiled Apr  5 2012 17:26:01.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http://www.gnu.org/copyleft/gpl.html for details.
ResultSet>readpgn H602-K1122 Noomen KI Fianchetto.pgn
100 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
These are the results with some settings:

------------------------

1.- Poppins' settings:

http://www.talkchess.com/forum/viewtopi ... 84&t=65404
tpoppins wrote:

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>exactdist
00:00:00,00
ResultSet-EloRating>ratings
Rank Name                          Elo    +    - games score oppo. draws
   1 Houdini 6.02 Pro x64-popcnt    15   24   24   100   56%   -15   82%
   2 Komodo 11.2.2 64-bit          -15   24   24   100   44%    15   82%
Myself:

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>exactdist
00:00:00,00
ResultSet-EloRating>ratings
Rank Name                            Elo     Diff     +     -      Games  Score   Oppo.   Draws     Win          W-L-D
   1 Houdini 6.02 Pro x64-popcnt   15.25     0.00  24.46  24.25      100  56.00%  -15.25  82.00%  15.00%        15-3-82
   2 Komodo 11.2.2 64-bit         -15.25   -30.50  24.25  24.46      100  44.00%   15.25  82.00%   3.00%         3-15-82
Identical results except the number of decimals.

------------------------

2.- Own settings (mm 0 1; 68.27% confidence ~ 1-sigma confidence):

Code: Select all

ResultSet>elo
ResultSet-EloRating>confidence 0.6827
0.6827
ResultSet-EloRating>mm 0 1
Iteration 100: 0.00214375
00:00:00,00
ResultSet-EloRating>ratings
Rank Name                            Elo     Diff     +     -      Games  Score   Oppo.   Draws     Win          W-L-D
   1 Houdini 6.02 Pro x64-popcnt   18.78     0.00  11.06  11.06      100  56.00%  -18.78  82.00%  15.00%        15-3-82
   2 Komodo 11.2.2 64-bit         -18.78   -37.56  11.06  11.06      100  44.00%   18.78  82.00%   3.00%         3-15-82
ResultSet-EloRating>los
                             Ho Ko
Houdini 6.02 Pro x64-popcnt     99
Komodo 11.2.2 64-bit          0
ResultSet-EloRating>
I get 37.56 Bayeselo difference with 1-sigma error bars of ±11.06 Bayeselo. LOS(Houdini) > 99% and LOS(Komodo) < 1%.

------------------------

3.- Own settings (mm 0 1; 95% confidence ~ 1.96-sigma confidence):

Code: Select all

ResultSet>elo
ResultSet-EloRating>confidence 0.95
0.95
ResultSet-EloRating>mm 0 1
Iteration 100&#58; 0.00214375
00&#58;00&#58;00,00
ResultSet-EloRating>ratings
Rank Name                            Elo     Diff     +     -      Games  Score   Oppo.   Draws     Win          W-L-D
   1 Houdini 6.02 Pro x64-popcnt   18.78     0.00  21.67  21.67      100  56.00%  -18.78  82.00%  15.00%        15-3-82
   2 Komodo 11.2.2 64-bit         -18.78   -37.56  21.67  21.67      100  44.00%   18.78  82.00%   3.00%         3-15-82
ResultSet-EloRating>los
                             Ho Ko
Houdini 6.02 Pro x64-popcnt     99
Komodo 11.2.2 64-bit          0
I get 37.56 Bayeselo difference again. 1.96-sigma error bars are ±21.67 Bayeselo. LOS is the same as before.

------------------------

I used two different confidence levels to see what happens. Since scores are close enough to 50%, I would expect that:

Code: Select all

z2-sigma ==> ±err2
z3-sigma ==> ±err3

&#40;z3&#41;/&#40;z2&#41; ~ |err3|/|err2|
It happens in EloSTAT model for scores close to 50% IIRC. In fact, with z2 = 1 and z3 = 1.96:

Code: Select all

&#40;21.67&#41;/&#40;11.06&#41; ~ 1.9593
Which confirms that confidence setting works in the version that I use.

I have not used Bayeselo for years and I struggled a bit to remember that I was used to type:

Code: Select all

readpgn &#91;...&#93;.pgn
elo
confidence 0.95
mm 0 1
ratings
los
There is not a special reason to use these settings, but people could feel more confortable in this particular 100-game match with 37.56 ± 21.67 than with 30 ± 24.

Just for the record, I get more less 42 ± 28 Elo (95% confidence) and LOS(Houdini) ~ 99.8% with my own calculator, which gives very similar results than EloSTAT in two-engine matches. Again, I have been a lot of time without computing error bars.

Regards from Spain.

Ajedrecista.
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: Fishy Bayeselo numbers?

Post by MikeB »

A run using the pgn above with the parameters.

Code: Select all

Mac-Pro&#58;cluster.mfb michaelbyrne$ bay
version 0058, Copyright &#40;C&#41; 1997-2016 Remi Coulom and updated by Michael Byrne.
compiled Jul 24 2016 00&#58;03&#58;35.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http&#58;//www.gnu.org/copyleft/gpl.html for details.

ResultSet>rp /Users/michaelbyrne/Downloads/H602-K1122NoomenKIFianchetto.pgn 
100 game&#40;s&#41; loaded
ResultSet>elo
ResultSet-EloRating>confidence 0.95
0.9
ResultSet-EloRating>mm 0 1
Iteration 100&#58; 0.002 
00&#58;00&#58;00,00
ResultSet-EloRating>r
Rank Name                         Rating   &#916;     +    -     #     &#931;    &#931;%     W    L    D   W%    =%   OppR 
---------------------------------------------------------------------------------------------------------
   1 Houdini 6.02 Pro x64-popcnt   3119   0.0   22   22   100   56.0  56.0   15    3   82  15.0  82.0  3081 
   2 Komodo 11.2.2 64-bit          3081  37.6   22   22   100   44.0  44.0    3   15   82   3.0  82.0  3119 
---------------------------------------------------------------------------------------------------------
  &#916; = delta from the next higher rated opponent
  # = number of games played
  &#931; = total score, 1 point for win, 1/2 point for draw

ResultSet-EloRating>los
                             Ho Ko
Houdini 6.02 Pro x64-popcnt     99
Komodo 11.2.2 64-bit          0   
ResultSet-EloRating>
Which is aligned with your last two runs. Will use these going forward.