CCRL question re bayeselo

Gabor Szots · Post by **Gabor Szots** » Sat Apr 29, 2023 10:07 pm

This is what Ray has found:

Using a value of 150 makes no difference to the Elo improvement of Dragon, see below.

BayesElo default

Code: Select all

Rank Name                                      Elo    +    - games score oppo. draws 
   5 Dragon by Komodo 3.2 64-bit 4CPU          770   15   15  1142   63%   696   73% 
   8 Dragon by Komodo 3 64-bit 4CPU            765   15   15  1126   62%   696   75% 
   9 Dragon by Komodo 3.1 64-bit 4CPU          765   15   15  1140   62%   696   75% 
  13 Dragon by Komodo 2.6 64-bit 4CPU          762   18   18   864   64%   678   70% 
  19 Dragon by Komodo 2.5 64-bit 4CPU          750   18   18   798   63%   678   73%

With the 150 setting:

Code: Select all

Rank Name                                      Elo    +    - games score oppo. draws 
   4 Dragon by Komodo 3.2 64-bit 4CPU          825   14   14  1142   63%   745   73% 
   7 Dragon by Komodo 3 64-bit 4CPU            820   14   14  1126   62%   745   75% 
   9 Dragon by Komodo 3.1 64-bit 4CPU          819   14   14  1140   62%   745   75% 
  13 Dragon by Komodo 2.6 64-bit 4CPU          816   17   17   864   64%   726   70% 
  18 Dragon by Komodo 2.5 64-bit 4CPU          804   17   17   798   63%   726   73%

So +20 Elo from v2.5 to 3.2 in one, and +21 Elo in the other.

The 150 figure does give a bigger spread, top to bottom of the list is a range of 2293 Elo with 150 setting vs default setting range of 2156 Elo. So it reduces the so-called "compression" effect. However it introduces a nasty effect at the bottom of the list with error margins that we currently see on the blitz list:

4015 default

Code: Select all

Rank Name                                      Elo    +    - games score oppo. draws 
3368 MSCP 1.6h                               -1229   25   25   628   44% -1178   15% 
3369 Apep 0.1.0                              -1279   25   25   683   41% -1207   10% 
3370 Saruman 2017.08.10 64-bit               -1292   18   18  1249   31% -1140   27% 
3371 Pooky 2.7                               -1319   26   26   626   32% -1174   17% 
3372 Dreamer 0.3.0 64-bit                    -1332   25   26   640   34% -1210   17% 
3373 Hokus-Pokus 0.6.3                       -1368   28   18   628   27% -1171   11% 
3374 BACE 0.46                               -1380   24    8  1027   21% -1123   10%

4015 with 150 setting

Code: Select all

Rank Name                                      Elo    +    - games score oppo. draws 
3368 MSCP 1.6h                               -1308   65  -55   628   44% -1254   15% 
3369 Apep 0.1.0                              -1360  114 -107   683   41% -1283   10% 
3370 Saruman 2017.08.10 64-bit               -1374  126 -121  1249   31% -1214   27% 
3371 Pooky 2.7                               -1403  155 -150   626   32% -1250   17% 
3372 Dreamer 0.3.0 64-bit                    -1418  170 -165   640   34% -1288   17% 
3373 Hokus-Pokus 0.6.3                       -1454  206 -201   628   27% -1247   11%

lkaufman · Post by **lkaufman** » Sat Apr 29, 2023 10:55 pm

Interesting. So the increased eloDraw spreads out the ratings by about 7%, much less than I would have guessed, but at least in the right direction. It is quite surprising that it would have such a huge effect on error margins at the lower end; I would just say that assuming 150 is close to the correct value for eloDraw (I was looking at blitz data, it should be even higher for Rapid data, maybe 170 or so), it might just be telling you that the error margins shown for the weak engines were unrealistically low. I suppose that they have very few draws whereas the pool overall has a lot of draws, which when properly accounted for by eloDraw has this consequence. But I can't explain the magnitude of the effect.

Modern Times · Post by **Modern Times** » Sat Apr 29, 2023 11:26 pm

lkaufman wrote: ↑Sat Apr 29, 2023 10:55 pm It is quite surprising that it would have such a huge effect on error margins at the lower end;

That already exists on the blitz list (with default values) to a shocking degree. Unclear if this is the truth, or some sort of bayeselo bug:

Code: Select all

Rank                 Engine                   Elo   +    -   Score  AvOp  Games
    634 DarkFusch 0.9                           1212  +28   -1  63.5% -122.6   682
    635 Belofte 2.1.3 64-bit                    1211  +25    0  50.2%   -6.0   830
    636 Tikov 0.6.3                             1166  +61  -45  42.9%  +52.2   463
    637 Enxadrista 1.01                         1147  +73  -63  52.9%  -34.9   801
    638 Neurone XXVII                           1129  +89  -82  45.3%  +24.4   878
    639 Megalodon 1.0.0 64-bit                  1117  +99  -93  46.6%  +14.4  1012
    640 Cassandre 0.24                          1096 +121 -114  57.6%  -65.3   786
    641 Iota 1.0                                1081 +136 -129  56.2%  -62.5   737
    642 ROBOKewlper 0.047a                      1059 +157 -151  44.8%  +34.7   778
    643 Safrad 2.2.40.360                        975 +244 -235  55.0%  -40.3   301
    644 Delimiter 0.1.1 64-bit                   909 +307 -302  36.6% +106.9   757
    645 Zoe 0.1                                  880 +336 -330  52.0%  -12.1   367
    646 Pyotr 0.6 AM                             781 +434 -429  69.9% -131.6   498
    647 Monchester 1.0 64-bit                    774 +442 -437  40.5%  +77.4   835
    648 Dika 0.4209                              725 +490 -485  63.5%  -75.1   504
    649 Chessputer revision 7 64-bit             689 +527 -522  59.8%  -51.2   462
    650 AcquaD 3.9.1 LDC                         670 +546 -540  43.4%  +65.4   843
    651 Alouette 0.1.4 64-bit                    660 +556 -551  11.8% +369.8   436
    652 EasyPeasy 1.0                            429 +786 -781  23.2% +179.0   381

lkaufman · Post by **lkaufman** » Sun Apr 30, 2023 1:30 am

Modern Times wrote: ↑Sat Apr 29, 2023 11:26 pm

lkaufman wrote: ↑Sat Apr 29, 2023 10:55 pm It is quite surprising that it would have such a huge effect on error margins at the lower end;

That already exists on the blitz list (with default values) to a shocking degree. Unclear if this is the truth, or some sort of bayeselo bug:

Code: Select all

Rank                 Engine                   Elo   +    -   Score  AvOp  Games
    634 DarkFusch 0.9                           1212  +28   -1  63.5% -122.6   682
    635 Belofte 2.1.3 64-bit                    1211  +25    0  50.2%   -6.0   830
    636 Tikov 0.6.3                             1166  +61  -45  42.9%  +52.2   463
    637 Enxadrista 1.01                         1147  +73  -63  52.9%  -34.9   801
    638 Neurone XXVII                           1129  +89  -82  45.3%  +24.4   878
    639 Megalodon 1.0.0 64-bit                  1117  +99  -93  46.6%  +14.4  1012
    640 Cassandre 0.24                          1096 +121 -114  57.6%  -65.3   786
    641 Iota 1.0                                1081 +136 -129  56.2%  -62.5   737
    642 ROBOKewlper 0.047a                      1059 +157 -151  44.8%  +34.7   778
    643 Safrad 2.2.40.360                        975 +244 -235  55.0%  -40.3   301
    644 Delimiter 0.1.1 64-bit                   909 +307 -302  36.6% +106.9   757
    645 Zoe 0.1                                  880 +336 -330  52.0%  -12.1   367
    646 Pyotr 0.6 AM                             781 +434 -429  69.9% -131.6   498
    647 Monchester 1.0 64-bit                    774 +442 -437  40.5%  +77.4   835
    648 Dika 0.4209                              725 +490 -485  63.5%  -75.1   504
    649 Chessputer revision 7 64-bit             689 +527 -522  59.8%  -51.2   462
    650 AcquaD 3.9.1 LDC                         670 +546 -540  43.4%  +65.4   843
    651 Alouette 0.1.4 64-bit                    660 +556 -551  11.8% +369.8   436
    652 EasyPeasy 1.0                            429 +786 -781  23.2% +179.0   381

Such enormous numbers for MOE don't seem remotely plausible. Same issue on the bottom ratings of the archived list. Note also the first two entries in your above list, which show ranges of +28 to -1 and +25 to 0. Almost all other ranges are roughly symmetrical, with the plus and minus numbers usually differing by a small percentage only. It makes no sense to have a 25 elo margin on the upside and zero on the downside. Something is clearly wrong, I just don't know what it is. Anyway I think it means that you shouldn't pay any attention to MOE values for the low-rated engines until this is fixed or explained.

Modern Times · Post by **Modern Times** » Sun Apr 30, 2023 3:13 am

Has to be a bug is most likely, so yes those MOEs have to be ignored unless Remi would be inclined to look and fix.

Modern Times · Post by **Modern Times** » Thu Jan 04, 2024 2:28 pm

Resurrecting a 9-month old thread in case Larry is still looking at the forum and is still interested...

We've solved the error margin problems with bayeselo at the bottom of the blitz and other lists. Using the parameter "covariance" instead of "exactdist" that we had been using gives error values that are as you would expect.

Bayeselo offer 4 different algorithms for computing confidence intervals. This is the list of options, from the least accurate and fastest, to the most accurate and slowest:

Default: assume opponents ratings are their true ratings, and Gaussian distribution

"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.

"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)

"jointdist": computes a numerical estimation of the whole distribution. It is the most accurate, but the cost is exponential in the number of players. May work for 3-4 players. You should reduce the resolution of the discretization for more players.

In terms of the draw values discussed earlier in the thread, we talked around a 117.861 value that was experimented with 10 yrs ago on blitz. Further digging reveals that that there was also a 4015 value of 154.278 that was calculated at that same time. Very close Larry to the numbers of 150-160 you were suggesting might be more suitable. That number is probably even higher now. At the time, one of the team extracted all engine pairings within 50 Elo of each other as the basis of making those calculations, and he also looked at them in 100 Elo buckets. He is no longer with us unfortunately so we don't know exactly how he arrived at the numbers. I am investigating further.

lkaufman · Post by **lkaufman** » Thu Jan 04, 2024 7:39 pm

Modern Times wrote: ↑Thu Jan 04, 2024 2:28 pm Resurrecting a 9-month old thread in case Larry is still looking at the forum and is still interested...

We've solved the error margin problems with bayeselo at the bottom of the blitz and other lists. Using the parameter "covariance" instead of "exactdist" that we had been using gives error values that are as you would expect.

Bayeselo offer 4 different algorithms for computing confidence intervals. This is the list of options, from the least accurate and fastest, to the most accurate and slowest:

Default: assume opponents ratings are their true ratings, and Gaussian distribution

"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.

"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)

"jointdist": computes a numerical estimation of the whole distribution. It is the most accurate, but the cost is exponential in the number of players. May work for 3-4 players. You should reduce the resolution of the discretization for more players.

In terms of the draw values discussed earlier in the thread, we talked around a 117.861 value that was experimented with 10 yrs ago on blitz. Further digging reveals that that there was also a 4015 value of 154.278 that was calculated at that same time. Very close Larry to the numbers of 150-160 you were suggesting might be more suitable. That number is probably even higher now. At the time, one of the team extracted all engine pairings within 50 Elo of each other as the basis of making those calculations, and he also looked at them in 100 Elo buckets. He is no longer with us unfortunately so we don't know exactly how he arrived at the numbers. I am investigating further.

Thanks. Glad that the MOE issue is resolved, and that there is some confirmation of my higher DrawElo value estimate. I suppose the "4015" in your post is a typo, probably you meant the year 2015?

Gabor Szots · Post by **Gabor Szots** » Thu Jan 04, 2024 8:24 pm

lkaufman wrote: ↑Thu Jan 04, 2024 7:39 pmI suppose the "4015" in your post is a typo, probably you meant the year 2015?

No, it refers to the 40/15 (then 40/40) list.

Modern Times · Post by **Modern Times** » Fri Jan 05, 2024 2:46 am

Yes, sorry for the confusion, I should have typed 40/15.

The bayeselo defaults are from a fairly small sample in today's world:

eloAdvantage indicates the advantage of playing first. eloDraw indicates how likely draws are. The default values in the program were obtained by finding their maximum-likelihood values over 29,610 games of Leo Dijksman's WBEC. The value measured, with 95% confidence intervals are:

eloAdvantage = 32.8 +/- 4
eloDraw = 97.3 +/- 2
bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below.

A value of 154.278 gets very close to the Elo spread that Ordo calculates, and a value of 192 (which you estimate as "a drawElo of 192 would mean half draws between equals") give a spread only 3 Elo different to Ordo. I did pull out all games from 40/15 where opponents were less than 50 Elo apart, and then got rid of engines below 2500 Elo. The draw ratio on those 600,000+ games is just under 55%.

Even though I don't have the detailed backup for the 154.278, it came from Adam Hair who worked very closely with Miguel on Ordo and is credited in the Ordo user manual. So he has some good knowledge. Knowing now where the bayeselo defaults came from, I'm firmly of the view that we ought to be using Adam's numbers, or maybe even 192. They may be 10 years old now but I'm convinced they are a better fit for our database than the default.

lkaufman · Post by **lkaufman** » Fri Jan 05, 2024 4:29 am

Modern Times wrote: ↑Fri Jan 05, 2024 2:46 am Yes, sorry for the confusion, I should have typed 40/15.

The bayeselo defaults are from a fairly small sample in today's world:

eloAdvantage indicates the advantage of playing first. eloDraw indicates how likely draws are. The default values in the program were obtained by finding their maximum-likelihood values over 29,610 games of Leo Dijksman's WBEC. The value measured, with 95% confidence intervals are:

eloAdvantage = 32.8 +/- 4
eloDraw = 97.3 +/- 2
bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below.
A value of 154.278 gets very close to the Elo spread that Ordo calculates, and a value of 192 (which you estimate as "a drawElo of 192 would mean half draws between equals") give a spread only 3 Elo different to Ordo. I did pull out all games from 40/15 where opponents were less than 50 Elo apart, and then got rid of engines below 2500 Elo. The draw ratio on those 600,000+ games is just under 55%.

Even though I don't have the detailed backup for the 154.278, it came from Adam Hair who worked very closely with Miguel on Ordo and is credited in the Ordo user manual. So he has some good knowledge. Knowing now where the bayeselo defaults came from, I'm firmly of the view that we ought to be using Adam's numbers, or maybe even 192. They may be 10 years old now but I'm convinced they are a better fit for our database than the default.

Basically it comes down to which range of engines you want to get "right". The assumption of constant drawElo is just not realistic for a huge range of elo values, so Ordo is better to get the entire range right, but if you mostly care about a specific range (presumably the top portion), then using BayesElo with a fairly high DrawElo is okay. The value of 192 would probably be about right for the 2700 or 2800 to 4000 range. The 154 value might be best for a range like 2200 to 4000 or so. The current value might be right for the entire range, but surely there is far more interest in 3000 level ratings than in 1500 level ratings, given that the lower ones don't correlate as well with human ratings anyway.

CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo

Re: CCRL question re bayeselo