BayesianElo or Ordo ?

lkaufman · Post by **lkaufman** » Sun Oct 17, 2021 1:45 am

Raphexon wrote: ↑Sun Oct 17, 2021 12:05 am
Modern Times wrote: ↑Sat Oct 16, 2021 7:09 pm CCRL did reduce its ratings by 100 Elo a few years back because we felt they were too high.

Bayeselo is my preference. Some say that it compresses ratings, I'd turn that around and say that Ordo expands them. I don't think one is better than the other, they both have sound statistical grounding, but they work differently.

Daniel Shawul had this to say about the two last year:

forum3/viewtopic.php?f=7&t=73761&p=8413 ... lo#p841358

by Daniel Shawul » Sat Apr 25, 2020 2:08 pm

Why are you using Ordo anyway, clearly it has inferior algorithms than bayeselo that is based on bayesian approach.
Here https://www.remi-coulom.fr/Bayesian-Elo/ Remi remid discusses some of the advantages over prior alogrithm, EloStat.
For the calculation of accurate standard deviations, there is an option to calculate the covariance matrix ( a bit slower) is not the default.
Ordo probably uses a monte carlo sampling of some sort for that, but in bayeselo you find better theory and algorithm.

Bayeselo does have the home field advantage (color) into consideration bud does not take into consideration draw ratio.
It was later extended it to take that into consideration using the Davidson model which turned out to be the best of three other
draw models.

Bayesianelo is nice when a game has no or few draws.
Remi has a lot of history inside the computer Go community, and it shows...

Ordo is nicer for (modern computer) chess.

Yes, I think that the issue with BayesElo is that it assumes some constant draw percentage between equally rated opponents (a parameter setting, but once set a constant for all games rated). But as we know, the draw percentage between 3600 rated engines is much higher than the draw percentage between 2600 rated engines. So it is fundamentally flawed for chess, where there are a lot of draws and the draw percentage varies with level. Ordo doesn't have this problem.
But all current rating systems have the problem that rating differences contract with longer time limits (human or engine, doesn't matter) due to more draws, and also that when opening positions are stipulated (with color reversal) the amount of one side's advantage also affects rating spread. Ideally a rating system should be immune (in terms of overall spread) to time control and opening choice in reversal play. I don't know of any that qualify, but I do have a proposal that might solve or at least dramatically reduce these two problems. I'm assuming reversal testing of paired games with specified start positions. My proposal is simply to discard from the database to be rated all pairs of games that result in a 1 to 1 tied score (whether due to two draws or two wins), then run the remaining games thru Ordo. This is completely fair, but will obviously result in much larger rating differences. Extremely drawish openings will be mostly tossed due to many tied matches, while easily won positions will also be tossed for the same reason. What's left (when fairly equal opponents are playing) will be openings where both a win and a draw are plausible results. If you want the ratings to resemble human elo, just scale the differences (from the reference engine) by whatever percentage is needed. With this method I am certain that we will see continued progress at a good rate for years to come; a slightly better engine should win most of the matches that are not tied, even if most of them are tied and discarded. The Stockfish crowd should love this idea, they already talk about their results in such two game matches rather than in individual games. It could even be used for human chess where pairs of games are played.

ChickenLogic · Post by **ChickenLogic** » Sun Oct 17, 2021 2:21 am

lkaufman wrote: ↑Sun Oct 17, 2021 1:45 am
Raphexon wrote: ↑Sun Oct 17, 2021 12:05 am
Modern Times wrote: ↑Sat Oct 16, 2021 7:09 pm CCRL did reduce its ratings by 100 Elo a few years back because we felt they were too high.

Bayeselo is my preference. Some say that it compresses ratings, I'd turn that around and say that Ordo expands them. I don't think one is better than the other, they both have sound statistical grounding, but they work differently.

Daniel Shawul had this to say about the two last year:

forum3/viewtopic.php?f=7&t=73761&p=8413 ... lo#p841358

by Daniel Shawul » Sat Apr 25, 2020 2:08 pm

Why are you using Ordo anyway, clearly it has inferior algorithms than bayeselo that is based on bayesian approach.
Here https://www.remi-coulom.fr/Bayesian-Elo/ Remi remid discusses some of the advantages over prior alogrithm, EloStat.
For the calculation of accurate standard deviations, there is an option to calculate the covariance matrix ( a bit slower) is not the default.
Ordo probably uses a monte carlo sampling of some sort for that, but in bayeselo you find better theory and algorithm.

Bayeselo does have the home field advantage (color) into consideration bud does not take into consideration draw ratio.
It was later extended it to take that into consideration using the Davidson model which turned out to be the best of three other
draw models.

Bayesianelo is nice when a game has no or few draws.
Remi has a lot of history inside the computer Go community, and it shows...

Ordo is nicer for (modern computer) chess.
Yes, I think that the issue with BayesElo is that it assumes some constant draw percentage between equally rated opponents (a parameter setting, but once set a constant for all games rated). But as we know, the draw percentage between 3600 rated engines is much higher than the draw percentage between 2600 rated engines. So it is fundamentally flawed for chess, where there are a lot of draws and the draw percentage varies with level. Ordo doesn't have this problem.
But all current rating systems have the problem that rating differences contract with longer time limits (human or engine, doesn't matter) due to more draws, and also that when opening positions are stipulated (with color reversal) the amount of one side's advantage also affects rating spread. Ideally a rating system should be immune (in terms of overall spread) to time control and opening choice in reversal play. I don't know of any that qualify, but I do have a proposal that might solve or at least dramatically reduce these two problems. I'm assuming reversal testing of paired games with specified start positions. My proposal is simply to discard from the database to be rated all pairs of games that result in a 1 to 1 tied score (whether due to two draws or two wins), then run the remaining games thru Ordo. This is completely fair, but will obviously result in much larger rating differences. Extremely drawish openings will be mostly tossed due to many tied matches, while easily won positions will also be tossed for the same reason. What's left (when fairly equal opponents are playing) will be openings where both a win and a draw are plausible results. If you want the ratings to resemble human elo, just scale the differences (from the reference engine) by whatever percentage is needed. With this method I am certain that we will see continued progress at a good rate for years to come; a slightly better engine should win most of the matches that are not tied, even if most of them are tied and discarded. The Stockfish crowd should love this idea, they already talk about their results in such two game matches rather than in individual games. It could even be used for human chess where pairs of games are played.

Normalized elo is already used at fishtest: https://hardy.uhasselt.be/Fishtest/normalized_elo.pdf
With the pentanomial model used by fishtest it makes no sense to talk about single games anyway.

lkaufman · Post by **lkaufman** » Sun Oct 17, 2021 3:25 am

ChickenLogic wrote: ↑Sun Oct 17, 2021 2:21 am
lkaufman wrote: ↑Sun Oct 17, 2021 1:45 am
Raphexon wrote: ↑Sun Oct 17, 2021 12:05 am
Modern Times wrote: ↑Sat Oct 16, 2021 7:09 pm CCRL did reduce its ratings by 100 Elo a few years back because we felt they were too high.

Bayeselo is my preference. Some say that it compresses ratings, I'd turn that around and say that Ordo expands them. I don't think one is better than the other, they both have sound statistical grounding, but they work differently.

Daniel Shawul had this to say about the two last year:

forum3/viewtopic.php?f=7&t=73761&p=8413 ... lo#p841358

by Daniel Shawul » Sat Apr 25, 2020 2:08 pm

Why are you using Ordo anyway, clearly it has inferior algorithms than bayeselo that is based on bayesian approach.
Here https://www.remi-coulom.fr/Bayesian-Elo/ Remi remid discusses some of the advantages over prior alogrithm, EloStat.
For the calculation of accurate standard deviations, there is an option to calculate the covariance matrix ( a bit slower) is not the default.
Ordo probably uses a monte carlo sampling of some sort for that, but in bayeselo you find better theory and algorithm.

Bayeselo does have the home field advantage (color) into consideration bud does not take into consideration draw ratio.
It was later extended it to take that into consideration using the Davidson model which turned out to be the best of three other
draw models.

Bayesianelo is nice when a game has no or few draws.
Remi has a lot of history inside the computer Go community, and it shows...

Ordo is nicer for (modern computer) chess.
Yes, I think that the issue with BayesElo is that it assumes some constant draw percentage between equally rated opponents (a parameter setting, but once set a constant for all games rated). But as we know, the draw percentage between 3600 rated engines is much higher than the draw percentage between 2600 rated engines. So it is fundamentally flawed for chess, where there are a lot of draws and the draw percentage varies with level. Ordo doesn't have this problem.
But all current rating systems have the problem that rating differences contract with longer time limits (human or engine, doesn't matter) due to more draws, and also that when opening positions are stipulated (with color reversal) the amount of one side's advantage also affects rating spread. Ideally a rating system should be immune (in terms of overall spread) to time control and opening choice in reversal play. I don't know of any that qualify, but I do have a proposal that might solve or at least dramatically reduce these two problems. I'm assuming reversal testing of paired games with specified start positions. My proposal is simply to discard from the database to be rated all pairs of games that result in a 1 to 1 tied score (whether due to two draws or two wins), then run the remaining games thru Ordo. This is completely fair, but will obviously result in much larger rating differences. Extremely drawish openings will be mostly tossed due to many tied matches, while easily won positions will also be tossed for the same reason. What's left (when fairly equal opponents are playing) will be openings where both a win and a draw are plausible results. If you want the ratings to resemble human elo, just scale the differences (from the reference engine) by whatever percentage is needed. With this method I am certain that we will see continued progress at a good rate for years to come; a slightly better engine should win most of the matches that are not tied, even if most of them are tied and discarded. The Stockfish crowd should love this idea, they already talk about their results in such two game matches rather than in individual games. It could even be used for human chess where pairs of games are played.
Normalized elo is already used at fishtest: https://hardy.uhasselt.be/Fishtest/normalized_elo.pdf
With the pentanomial model used by fishtest it makes no sense to talk about single games anyway.

Thanks, I'll have to study that pdf. Has anyone produced a rating list of multiple (unrelated) engines using this method? Would it work properly for a pool of games where some strong engines draw 90% of their games with equals while weak engines draw 30% of their games with equals (for example)?

Modern Times · Post by **Modern Times** » Sun Oct 17, 2021 8:58 am

Rebel wrote: ↑Sun Oct 17, 2021 12:44 am
Modern Times wrote: ↑Sat Oct 16, 2021 10:30 pm
Rebel wrote: ↑Sat Oct 16, 2021 8:55 pm
CCRL 40/15
SF12 - 3476
SF11 - 3433
Only +43

CEGT 40/20
SF12 - 3530
SF11 - 3435
+95

How do you explain ?
Explain what ? Run the 40/15 database through Ordo and the ratings diff is +52 Elo (rather than +43). Not a lot different. So the ratings tool isn't making much difference. It is what it is.
I did as well :
Code: Select all
   # PLAYER                                     :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)  OppAvg
  17 Stockfish 12 64-bit                        :  3521.0   17.9   479.0     738    65      57  234   490   14    66  3405.2
  40 Stockfish 11 64-bit                        :  3433.0    6.3   304.0     506    60      57  119   370   17    73  3379.6
3251 - 3433 = + 88

We must have a different Ordo then, or we're using different databases:

Code: Select all

 # PLAYER                                     :  RATING  POINTS  PLAYED   (%)
 36 Stockfish 11 64-bit                        :  3562.2  1136.5    1549    73
 22 Stockfish 12 64-bit                        :  3613.7  1200.5    1612    74

3614 - 3562 = +52 Elo

amanjpro · Post by **amanjpro** » Sun Oct 17, 2021 9:27 am

Modern Times wrote: ↑Sun Oct 17, 2021 8:58 am
Rebel wrote: ↑Sun Oct 17, 2021 12:44 am
Modern Times wrote: ↑Sat Oct 16, 2021 10:30 pm
Rebel wrote: ↑Sat Oct 16, 2021 8:55 pm
CCRL 40/15
SF12 - 3476
SF11 - 3433
Only +43

CEGT 40/20
SF12 - 3530
SF11 - 3435
+95

How do you explain ?
Explain what ? Run the 40/15 database through Ordo and the ratings diff is +52 Elo (rather than +43). Not a lot different. So the ratings tool isn't making much difference. It is what it is.
I did as well :
Code: Select all
   # PLAYER                                     :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)  OppAvg
  17 Stockfish 12 64-bit                        :  3521.0   17.9   479.0     738    65      57  234   490   14    66  3405.2
  40 Stockfish 11 64-bit                        :  3433.0    6.3   304.0     506    60      57  119   370   17    73  3379.6
3251 - 3433 = + 88
We must have a different Ordo then, or we're using different databases:
Code: Select all
 # PLAYER                                     :  RATING  POINTS  PLAYED   (%)
 36 Stockfish 11 64-bit                        :  3562.2  1136.5    1549    73
 22 Stockfish 12 64-bit                        :  3613.7  1200.5    1612    74
3614 - 3562 = +52 Elo

Maybe 4CPU is used instead of 1CPU

Rebel · Post by **Rebel** » Sun Oct 17, 2021 10:04 am

Modern Times wrote: ↑Sun Oct 17, 2021 8:58 am
Rebel wrote: ↑Sun Oct 17, 2021 12:44 am
Modern Times wrote: ↑Sat Oct 16, 2021 10:30 pm
Rebel wrote: ↑Sat Oct 16, 2021 8:55 pm
CCRL 40/15
SF12 - 3476
SF11 - 3433
Only +43

CEGT 40/20
SF12 - 3530
SF11 - 3435
+95

How do you explain ?
Explain what ? Run the 40/15 database through Ordo and the ratings diff is +52 Elo (rather than +43). Not a lot different. So the ratings tool isn't making much difference. It is what it is.
I did as well :
Code: Select all
   # PLAYER                                     :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)  OppAvg
  17 Stockfish 12 64-bit                        :  3521.0   17.9   479.0     738    65      57  234   490   14    66  3405.2
  40 Stockfish 11 64-bit                        :  3433.0    6.3   304.0     506    60      57  119   370   17    73  3379.6
3251 - 3433 = + 88
We must have a different Ordo then, or we're using different databases:
Code: Select all
 # PLAYER                                     :  RATING  POINTS  PLAYED   (%)
 36 Stockfish 11 64-bit                        :  3562.2  1136.5    1549    73
 22 Stockfish 12 64-bit                        :  3613.7  1200.5    1612    74
3614 - 3562 = +52 Elo

Downloaded the latest 40/15 database, made a seletion of all 3200+ elo rated games in total 109.676 games as I assume you don't test a 3433 rated engine (SF11) against engines < 3200 elo.

I then guess it depends how you use Ordo, that would explain. I use 4 anchor engines for this pool.

anchor.csv

Code: Select all

"Stockfish 11 64-bit"       , 3433
"Houdini 6 64-bit"          , 3345
"Stockfish 8 64-bit"        , 3299
"Fire 7.1 64-bit"           , 3257

Ordo

Code: Select all

ordo-win64 -p ccrl.pgn -W -U "0,1,2,3,4,5,6,7,8,9,10,11" -o output.txt -m anchor.csv -C -J -s10 -V -D -j head-to-head.txt
start output.txt

Output.txt

Code: Select all

   # PLAYER                                     :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D     L  D(%)  OppAvg
  17 Stockfish 12 64-bit                        :  3520.3   15.6  1200.5    1612    74      50  806   789    17    49  3315.9
  37 Stockfish 11 64-bit                        :  3433.0    7.4  1082.0    1487    73      55  697   770    20    52  3280.0

3520 - 3433 = 87

Rebel · Post by **Rebel** » Sun Oct 17, 2021 10:21 am

I ran Ordo without anchor engines and indeed I get your 53 elo.

Code: Select all

   # PLAYER                                     :  RATING  POINTS  PLAYED   (%)
  18 Stockfish 12 64-bit                        :  3590.9  1200.5    1612    74
  29 Stockfish 11 64-bit                        :  3537.3  1082.0    1487    73

3590 - 3537 = 53

Modern Times · Post by **Modern Times** » Sun Oct 17, 2021 11:16 am

So there are statistics and there are statistics. You can tell whatever story you like from a set of data, as our politicians often do !

Rebel · Post by **Rebel** » Sun Oct 17, 2021 11:43 am

"Denial ain't just a river in Egypt."
- Mark Twain (1835-1910)

Ferdy · Post by **Ferdy** » Sun Oct 17, 2021 12:32 pm

A sample scenario where bayeselo is I believe better than ordo.

Code: Select all

A scored 2.5/3 vs B, A has 3 whites
B scored 0.5/1 vs C, B has 1 white
C score 4.5/5 vs D, C has 5 whites

Code: Select all

Head to head statistics:

1) C  163 :      6 (+4,=2,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   D       :      5 ( 4, 1, 0),  90.0 :   +110,   59,   96.9
   B       :      1 ( 0, 1, 0),  50.0 :   +275,   63,  100.0

2) D   53 :      5 (+0,=1,-4),  10.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      5 ( 0, 1, 4),  10.0 :   -110,   59,    3.1

3) A -104 :      3 (+2,=1,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   B       :      3 ( 2, 1, 0),  83.3 :     +7,   67,   54.4

4) B -112 :      4 (+0,=2,-2),  25.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      1 ( 0, 1, 0),  50.0 :   -275,   63,    0.0
   A       :      3 ( 0, 1, 2),  16.7 :     -7,   67,   45.6

Ordo anchored at 0.
Options: -W -D -G

Code: Select all

   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 C         :     163     86     5.0       6  83.3    4    0
   2 D         :      53     92     0.5       5  10.0    0    4
   3 A         :    -104    101     2.5       3  83.3    2    0
   4 B         :    -112     86     1.0       4  25.0    0    2

Ordo anchored at 0.

Code: Select all

   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 A         :     308    154     2.5       3  83.3    2    0
   2 C         :      26     68     5.0       6  83.3    4    0
   3 B         :      26     68     1.0       4  25.0    0    2
   4 D         :    -359    140     0.5       5  10.0    0    4

Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A      108  219  181     3   83%     6   33%
   2 C       27  168  141     6   83%  -117   33%
   3 B        6  155  174     4   25%    88   50%
   4 D     -141  158  210     5   10%    27   20%

I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.

Bayeselo:
advantage of playing first = 33

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A       57  221  183     3   83%   -16   33%
   2 C       47  170  142     6   83%   -76   33%
   3 B      -16  157  176     4   25%    55   50%
   4 D      -88  160  212     5   10%    47   20%

The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.

BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?