BayesianElo or Ordo ?

lkaufman · Post by **lkaufman** » Sun Oct 17, 2021 5:52 pm

Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.

A scored 2.5/3 vs B, A has 3 whites
B scored 0.5/1 vs C, B has 1 white
C score 4.5/5 vs D, C has 5 whites

Code: Select all

Head to head statistics:

1) C  163 :      6 (+4,=2,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   D       :      5 ( 4, 1, 0),  90.0 :   +110,   59,   96.9
   B       :      1 ( 0, 1, 0),  50.0 :   +275,   63,  100.0

2) D   53 :      5 (+0,=1,-4),  10.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      5 ( 0, 1, 4),  10.0 :   -110,   59,    3.1

3) A -104 :      3 (+2,=1,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   B       :      3 ( 2, 1, 0),  83.3 :     +7,   67,   54.4

4) B -112 :      4 (+0,=2,-2),  25.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      1 ( 0, 1, 0),  50.0 :   -275,   63,    0.0
   A       :      3 ( 0, 1, 2),  16.7 :     -7,   67,   45.6

Ordo anchored at 0.
Options: -W -D -G

Code: Select all

   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 C         :     163     86     5.0       6  83.3    4    0
   2 D         :      53     92     0.5       5  10.0    0    4
   3 A         :    -104    101     2.5       3  83.3    2    0
   4 B         :    -112     86     1.0       4  25.0    0    2

Ordo anchored at 0.

Code: Select all

   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 A         :     308    154     2.5       3  83.3    2    0
   2 C         :      26     68     5.0       6  83.3    4    0
   3 B         :      26     68     1.0       4  25.0    0    2
   4 D         :    -359    140     0.5       5  10.0    0    4

Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A      108  219  181     3   83%     6   33%
   2 C       27  168  141     6   83%  -117   33%
   3 B        6  155  174     4   25%    88   50%
   4 D     -141  158  210     5   10%    27   20%

I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.

Bayeselo:
advantage of playing first = 33

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A       57  221  183     3   83%   -16   33%
   2 C       47  170  142     6   83%   -76   33%
   3 B      -16  157  176     4   25%    55   50%
   4 D      -88  160  212     5   10%    47   20%

The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.

You seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?

Ferdy · Post by **Ferdy** » Sun Oct 17, 2021 8:22 pm

lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pm
Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites
B scored 0.5/1 vs C, B has 1 white
C score 4.5/5 vs D, C has 5 whites
Code: Select all
Head to head statistics:

1) C  163 :      6 (+4,=2,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   D       :      5 ( 4, 1, 0),  90.0 :   +110,   59,   96.9
   B       :      1 ( 0, 1, 0),  50.0 :   +275,   63,  100.0

2) D   53 :      5 (+0,=1,-4),  10.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      5 ( 0, 1, 4),  10.0 :   -110,   59,    3.1

3) A -104 :      3 (+2,=1,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   B       :      3 ( 2, 1, 0),  83.3 :     +7,   67,   54.4

4) B -112 :      4 (+0,=2,-2),  25.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      1 ( 0, 1, 0),  50.0 :   -275,   63,    0.0
   A       :      3 ( 0, 1, 2),  16.7 :     -7,   67,   45.6
Ordo anchored at 0.
Options: -W -D -G
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 C         :     163     86     5.0       6  83.3    4    0
   2 D         :      53     92     0.5       5  10.0    0    4
   3 A         :    -104    101     2.5       3  83.3    2    0
   4 B         :    -112     86     1.0       4  25.0    0    2
Ordo anchored at 0.
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 A         :     308    154     2.5       3  83.3    2    0
   2 C         :      26     68     5.0       6  83.3    4    0
   3 B         :      26     68     1.0       4  25.0    0    2
   4 D         :    -359    140     0.5       5  10.0    0    4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A      108  219  181     3   83%     6   33%
   2 C       27  168  141     6   83%  -117   33%
   3 B        6  155  174     4   25%    88   50%
   4 D     -141  158  210     5   10%    27   20%
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.

Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       57  221  183     3   83%   -16   33%
   2 C       47  170  142     6   83%   -76   33%
   3 B      -16  157  176     4   25%    55   50%
   4 D      -88  160  212     5   10%    47   20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
You seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?

With more draws the Bayeselo gap decreases but at certain limit.

Score rate at 76% with draws.

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A       79   18   18  1000   76%   -79   48%
   2 B      -79   18   18  1000   24%    79   48%

More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A       80    6    6 10000   76%   -80   48%
   2 B      -80    6    6 10000   24%    80   48%

To increase the gap more wins are needed and draws have to be minimized.

BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.

lkaufman · Post by **lkaufman** » Sun Oct 17, 2021 9:48 pm

Ferdy wrote: ↑Sun Oct 17, 2021 8:22 pm
lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pm
Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites
B scored 0.5/1 vs C, B has 1 white
C score 4.5/5 vs D, C has 5 whites
Code: Select all
Head to head statistics:

1) C  163 :      6 (+4,=2,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   D       :      5 ( 4, 1, 0),  90.0 :   +110,   59,   96.9
   B       :      1 ( 0, 1, 0),  50.0 :   +275,   63,  100.0

2) D   53 :      5 (+0,=1,-4),  10.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      5 ( 0, 1, 4),  10.0 :   -110,   59,    3.1

3) A -104 :      3 (+2,=1,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   B       :      3 ( 2, 1, 0),  83.3 :     +7,   67,   54.4

4) B -112 :      4 (+0,=2,-2),  25.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      1 ( 0, 1, 0),  50.0 :   -275,   63,    0.0
   A       :      3 ( 0, 1, 2),  16.7 :     -7,   67,   45.6
Ordo anchored at 0.
Options: -W -D -G
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 C         :     163     86     5.0       6  83.3    4    0
   2 D         :      53     92     0.5       5  10.0    0    4
   3 A         :    -104    101     2.5       3  83.3    2    0
   4 B         :    -112     86     1.0       4  25.0    0    2
Ordo anchored at 0.
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 A         :     308    154     2.5       3  83.3    2    0
   2 C         :      26     68     5.0       6  83.3    4    0
   3 B         :      26     68     1.0       4  25.0    0    2
   4 D         :    -359    140     0.5       5  10.0    0    4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A      108  219  181     3   83%     6   33%
   2 C       27  168  141     6   83%  -117   33%
   3 B        6  155  174     4   25%    88   50%
   4 D     -141  158  210     5   10%    27   20%
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.

Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       57  221  183     3   83%   -16   33%
   2 C       47  170  142     6   83%   -76   33%
   3 B      -16  157  176     4   25%    55   50%
   4 D      -88  160  212     5   10%    47   20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
You seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?
With more draws the Bayeselo gap decreases but at certain limit.

Score rate at 76% with draws.
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       79   18   18  1000   76%   -79   48%
   2 B      -79   18   18  1000   24%    79   48%
More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       80    6    6 10000   76%   -80   48%
   2 B      -80    6    6 10000   24%    80   48%
To increase the gap more wins are needed and draws have to be minimized.

BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.

Thanks. So based on this it seems that once the engines have played a thousand games or so (typical in the rating lists), playing more games doesn't have much effect on the scale of the list, only on the accuracy of individual engine ratings. If I assume that with no draws BayesElo matches Ordo with large numbers of games (i.e. 76% wins = 200 elo gap) (is this assumption correct?), then the conclusion is that the contraction from using BayesElo ranges from nothing with no draws to 20% with no wins for the inferior engine (with enough games). If this is accurate then changing the scale of the list will result in overstating the rating differences in the low end (where there are few draws) while still understating them at the high end (where the weaker engine rarely wins). I think it's just not suitable for the actual situation of rating lists where the same 76% score will show anywhere from 160 to 200 elo gap (with a lot of games) depending on draw frequency, which in turn depends on engine strength and time limit.

Ferdy · Post by **Ferdy** » Mon Oct 18, 2021 7:01 am

lkaufman wrote: ↑Sun Oct 17, 2021 9:48 pm
Ferdy wrote: ↑Sun Oct 17, 2021 8:22 pm
lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pm
Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites
B scored 0.5/1 vs C, B has 1 white
C score 4.5/5 vs D, C has 5 whites
Code: Select all
Head to head statistics:

1) C  163 :      6 (+4,=2,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   D       :      5 ( 4, 1, 0),  90.0 :   +110,   59,   96.9
   B       :      1 ( 0, 1, 0),  50.0 :   +275,   63,  100.0

2) D   53 :      5 (+0,=1,-4),  10.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      5 ( 0, 1, 4),  10.0 :   -110,   59,    3.1

3) A -104 :      3 (+2,=1,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   B       :      3 ( 2, 1, 0),  83.3 :     +7,   67,   54.4

4) B -112 :      4 (+0,=2,-2),  25.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      1 ( 0, 1, 0),  50.0 :   -275,   63,    0.0
   A       :      3 ( 0, 1, 2),  16.7 :     -7,   67,   45.6
Ordo anchored at 0.
Options: -W -D -G
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 C         :     163     86     5.0       6  83.3    4    0
   2 D         :      53     92     0.5       5  10.0    0    4
   3 A         :    -104    101     2.5       3  83.3    2    0
   4 B         :    -112     86     1.0       4  25.0    0    2
Ordo anchored at 0.
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 A         :     308    154     2.5       3  83.3    2    0
   2 C         :      26     68     5.0       6  83.3    4    0
   3 B         :      26     68     1.0       4  25.0    0    2
   4 D         :    -359    140     0.5       5  10.0    0    4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A      108  219  181     3   83%     6   33%
   2 C       27  168  141     6   83%  -117   33%
   3 B        6  155  174     4   25%    88   50%
   4 D     -141  158  210     5   10%    27   20%
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.

Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       57  221  183     3   83%   -16   33%
   2 C       47  170  142     6   83%   -76   33%
   3 B      -16  157  176     4   25%    55   50%
   4 D      -88  160  212     5   10%    47   20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
You seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?
With more draws the Bayeselo gap decreases but at certain limit.

Score rate at 76% with draws.
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       79   18   18  1000   76%   -79   48%
   2 B      -79   18   18  1000   24%    79   48%
More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       80    6    6 10000   76%   -80   48%
   2 B      -80    6    6 10000   24%    80   48%
To increase the gap more wins are needed and draws have to be minimized.

BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.
Thanks. So based on this it seems that once the engines have played a thousand games or so (typical in the rating lists), playing more games doesn't have much effect on the scale of the list, only on the accuracy of individual engine ratings. If I assume that with no draws BayesElo matches Ordo with large numbers of games (i.e. 76% wins = 200 elo gap) (is this assumption correct?), then the conclusion is that the contraction from using BayesElo ranges from nothing with no draws to 20% with no wins for the inferior engine (with enough games). If this is accurate then changing the scale of the list will result in overstating the rating differences in the low end (where there are few draws) while still understating them at the high end (where the weaker engine rarely wins). I think it's just not suitable for the actual situation of rating lists where the same 76% score will show anywhere from 160 to 200 elo gap (with a lot of games) depending on draw frequency, which in turn depends on engine strength and time limit.

Comparison without draws at 76% score rate.

Ferdy · Post by **Ferdy** » Mon Oct 18, 2021 4:51 pm

CMCanavessi wrote: ↑Sat Oct 16, 2021 9:06 pm What about Glicko, Glicko-2, Trueskill, etc? I can't even find tools to parse a .pgn file with those.

Rating gap comparison with Glicko2. If Rd is high the rating gap is also high.

8 games is one rating period starting from r=1500, rd=50, 16 games and others are the same, they started from r=1500, rd=50.

I use the glicko2 module from here.

Ferdy · Post by **Ferdy** » Tue Oct 19, 2021 4:10 am

Ferdy wrote: ↑Mon Oct 18, 2021 4:51 pm
CMCanavessi wrote: ↑Sat Oct 16, 2021 9:06 pm What about Glicko, Glicko-2, Trueskill, etc? I can't even find tools to parse a .pgn file with those.
Rating gap comparison with Glicko2. If Rd is high the rating gap is also high.

8 games is one rating period starting from r=1500, rd=50, 16 games and others are the same, they started from r=1500, rd=50.

I use the glicko2 module from here.

Using TrueSkill from a single round-robin tournament

Initial rating

Code: Select all

name: Cheng 4.40 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Spike 1.4 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Wasp 4.5 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Hiarcs 14 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Stockfish 13 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Cheese 2.2 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: CT800 V1.42 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Arasan 21.1 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Minic 2.51 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Amyan 1.72 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Deuterium v2021.1.38.118 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Ufim v8.02 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Rhetoric 1.4.3 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Deuterium v2019.2.37.73 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Rodent IV 022 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)

The sigma or rating deviation indicates the degree of uncertainty, and mu is the mean skill or rating.

TrueSkill updated rating list after the tournament

Code: Select all

                                     name     rating     sigma
1                  Spike 1.4 UCI_Elo 1500  39.468082  3.877891
2                 Cheese 2.2 UCI_Elo 1500  37.579741  3.237980
3                 Minic 2.51 UCI_Elo 1500  31.911882  2.878162
4                 Amyan 1.72 UCI_Elo 1500  31.390036  2.992734
5             Rhetoric 1.4.3 UCI_Elo 1500  29.043300  2.709709
6                 Ufim v8.02 UCI_Elo 1500  28.067483  2.724383
7                   Wasp 4.5 UCI_Elo 1500  24.020628  2.712146
8               Stockfish 13 UCI_Elo 1500  23.431118  2.894256
9              Rodent IV 022 UCI_Elo 1500  22.783380  2.506261
10               CT800 V1.42 UCI_Elo 1500  22.713217  2.746842
11                Cheng 4.40 UCI_Elo 1500  22.376597  2.520873
12               Arasan 21.1 UCI_Elo 1500  18.383414  2.780081
13  Deuterium v2021.1.38.118 UCI_Elo 1500  17.977033  2.546783
14                 Hiarcs 14 UCI_Elo 1500  16.576647  2.735936
15   Deuterium v2019.2.37.73 UCI_Elo 1500  13.339400  3.455419

Ranking with bayeselo
ResultSet>readpgn tmp_ucielo1500.pgn
105 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>ratings

Code: Select all

Rank Name                                    Elo    +    - games score oppo. draws
   1 Spike 1.4 UCI_Elo 1500                  477  261  261    14  100%   -34    0%
   2 Cheese 2.2 UCI_Elo 1500                 373  220  220    14   93%   -27    0%
   3 Minic 2.51 UCI_Elo 1500                 226  192  192    14   79%   -16    0%
   4 Amyan 1.72 UCI_Elo 1500                 217  190  190    14   82%   -15    7%
   5 Ufim v8.02 UCI_Elo 1500                 101  175  175    14   64%    -7    0%
   6 Rhetoric 1.4.3 UCI_Elo 1500              77  168  168    14   61%    -6    7%
   7 Wasp 4.5 UCI_Elo 1500                   -37  166  166    14   46%     3    7%
   8 Stockfish 13 UCI_Elo 1500               -57  169  169    14   43%     4    0%
   9 Rodent IV 022 UCI_Elo 1500              -63  163  163    14   43%     4   14%
  10 Cheng 4.40 UCI_Elo 1500                 -80  165  165    14   43%     6   14%
  11 CT800 V1.42 UCI_Elo 1500               -143  168  168    14   29%    10   14%
  12 Deuterium v2021.1.38.118 UCI_Elo 1500  -208  181  181    14   25%    15    7%
  13 Arasan 21.1 UCI_Elo 1500               -228  172  172    14   21%    16   14%
  14 Hiarcs 14 UCI_Elo 1500                 -290  181  181    14   14%    21   14%
  15 Deuterium v2019.2.37.73 UCI_Elo 1500   -364  213  213    14    7%    26    0%

Ordo is not shown because there is a perfect player in Spike 1.4 UCI_Elo 1500.

The top 4; rank 7, 8 and 9; and last 2 of trueskill and bayeselo are the same.

I am using this library of trueskill.

Code to process pgn file

Code: Select all

"""
trueskill_rating.py

Read pgn from a single round-robin tournament and create trueskill list.

Limitation:
    Can only handle a single round-robin results.

Requirements:
    * Install Python 3.8 or higher
    * Install trueskill
        pip install trueskill
    * Install pandas
        pip install pandas
    * Install python chess
        pip install chess
"""

from trueskill import Rating, rate_1vs1
import chess
import chess.pgn
import pandas as pd


def get_players(pgnfn):
    """
    Get a list of player names.
    """
    players = []
    with open(pgnfn) as h:
        while True:
            game = chess.pgn.read_game(h)
            if game is None:
                break
            wp = game.headers['White']
            bp = game.headers['Black']
            players.append(wp)
            players.append(bp)
    
    return list(set(players))


def rating_list(pgnfn):
    """
    * Read players and define player objects.
    * Read game results one at a time and update player object.
    * Create a rating list.
    """
    
    players = get_players(pgnfn)

    # Build a dict of trueskill players.
    tsplayers = {}
    for p in players:
        tsplayers.update({p: Rating()})

    print('Initial rating')
    for p, r in tsplayers.items():
        print(f'name: {p}, data: {r}')
    print()

    # Update ratings.
    with open(pgnfn) as h:
        while True:
            game = chess.pgn.read_game(h)
            if game is None:
                break
            wp = game.headers['White']
            bp = game.headers['Black']
            res = game.headers['Result']

            if res == '1-0':
                # define
                wr, br = tsplayers[wp], tsplayers[bp]

                # Send result.
                wr, br = rate_1vs1(wr, br)

                # Update player data.
                tsplayers.update({wp: wr})
                tsplayers.update({bp: br})

            elif res == '0-1':
                br, wr = tsplayers[bp], tsplayers[wp]
                br, wr = rate_1vs1(br, wr)
                tsplayers.update({wp: wr})
                tsplayers.update({bp: br})

            elif res == '1/2-1/2':
                wr, br = tsplayers[wp], tsplayers[bp]
                wr, br = rate_1vs1(wr, br, drawn=True)
                tsplayers.update({wp: wr})
                tsplayers.update({bp: br})

    # Show in console.
    fplayers = []
    for n, r in tsplayers.items():
        name = n
        rating = r.mu
        sigma = r.sigma
        fplayers.append({'name': name, 'rating': rating, 'sigma': sigma})

    print('Updated rating list')
    df = pd.DataFrame(fplayers)
    df = df.sort_values(by=['rating', 'sigma'], ascending=[False, True])
    df = df.reset_index(drop=True)
    df.index += 1
    print(df.to_string())


def main():
    pgnfn = 'tmp_ucielo1500.pgn'
    rating_list(pgnfn)


if __name__ == "__main__":
    main()

lkaufman · Post by **lkaufman** » Tue Oct 19, 2021 5:25 am

Ferdy wrote: ↑Mon Oct 18, 2021 7:01 am
lkaufman wrote: ↑Sun Oct 17, 2021 9:48 pm
Ferdy wrote: ↑Sun Oct 17, 2021 8:22 pm
lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pm
Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites
B scored 0.5/1 vs C, B has 1 white
C score 4.5/5 vs D, C has 5 whites
Code: Select all
Head to head statistics:

1) C  163 :      6 (+4,=2,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   D       :      5 ( 4, 1, 0),  90.0 :   +110,   59,   96.9
   B       :      1 ( 0, 1, 0),  50.0 :   +275,   63,  100.0

2) D   53 :      5 (+0,=1,-4),  10.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      5 ( 0, 1, 4),  10.0 :   -110,   59,    3.1

3) A -104 :      3 (+2,=1,-0),  83.3 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   B       :      3 ( 2, 1, 0),  83.3 :     +7,   67,   54.4

4) B -112 :      4 (+0,=2,-2),  25.0 %

   vs.     :  games ( +, =, -),   (%) :   Diff,   SD, CFS (%)
   C       :      1 ( 0, 1, 0),  50.0 :   -275,   63,    0.0
   A       :      3 ( 0, 1, 2),  16.7 :     -7,   67,   45.6
Ordo anchored at 0.
Options: -W -D -G
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 C         :     163     86     5.0       6  83.3    4    0
   2 D         :      53     92     0.5       5  10.0    0    4
   3 A         :    -104    101     2.5       3  83.3    2    0
   4 B         :    -112     86     1.0       4  25.0    0    2
Ordo anchored at 0.
Code: Select all
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)    W    L
   1 A         :     308    154     2.5       3  83.3    2    0
   2 C         :      26     68     5.0       6  83.3    4    0
   3 B         :      26     68     1.0       4  25.0    0    2
   4 D         :    -359    140     0.5       5  10.0    0    4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A      108  219  181     3   83%     6   33%
   2 C       27  168  141     6   83%  -117   33%
   3 B        6  155  174     4   25%    88   50%
   4 D     -141  158  210     5   10%    27   20%
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.

Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       57  221  183     3   83%   -16   33%
   2 C       47  170  142     6   83%   -76   33%
   3 B      -16  157  176     4   25%    55   50%
   4 D      -88  160  212     5   10%    47   20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
You seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?
With more draws the Bayeselo gap decreases but at certain limit.

Score rate at 76% with draws.
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       79   18   18  1000   76%   -79   48%
   2 B      -79   18   18  1000   24%    79   48%
More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A       80    6    6 10000   76%   -80   48%
   2 B      -80    6    6 10000   24%    80   48%
To increase the gap more wins are needed and draws have to be minimized.

BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.
Thanks. So based on this it seems that once the engines have played a thousand games or so (typical in the rating lists), playing more games doesn't have much effect on the scale of the list, only on the accuracy of individual engine ratings. If I assume that with no draws BayesElo matches Ordo with large numbers of games (i.e. 76% wins = 200 elo gap) (is this assumption correct?), then the conclusion is that the contraction from using BayesElo ranges from nothing with no draws to 20% with no wins for the inferior engine (with enough games). If this is accurate then changing the scale of the list will result in overstating the rating differences in the low end (where there are few draws) while still understating them at the high end (where the weaker engine rarely wins). I think it's just not suitable for the actual situation of rating lists where the same 76% score will show anywhere from 160 to 200 elo gap (with a lot of games) depending on draw frequency, which in turn depends on engine strength and time limit.
Comparison without draws at 76% score rate.

So it seems the effect of draws on BayesElo is actually twice as large as I thought. Given 1000+ game samples, with no draws BayesElo actually spreads ratings by 20%, whereas with maximal draws it contracts them by 20%. So this should mean that at the low end (especially blitz list), CCRL ratings (which use BayesElo) may actually be expanded, while at the high end they are surely contracted. Scaling can't fix that.

ArthurKanishov · Post by **ArthurKanishov** » Mon Oct 25, 2021 2:29 am

Ferdy wrote: ↑Tue Oct 19, 2021 4:10 am

Ferdy wrote: ↑Mon Oct 18, 2021 4:51 pm
CMCanavessi wrote: ↑Sat Oct 16, 2021 9:06 pm What about Glicko, Glicko-2, Trueskill, etc? I can't even find tools to parse a .pgn file with those.
Rating gap comparison with Glicko2. If Rd is high the rating gap is also high.

8 games is one rating period starting from r=1500, rd=50, 16 games and others are the same, they started from r=1500, rd=50.

I use the glicko2 module from here.

Using TrueSkill from a single round-robin tournament

Initial rating

Code: Select all

name: Cheng 4.40 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Spike 1.4 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Wasp 4.5 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Hiarcs 14 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Stockfish 13 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Cheese 2.2 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: CT800 V1.42 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Arasan 21.1 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Minic 2.51 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Amyan 1.72 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Deuterium v2021.1.38.118 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Ufim v8.02 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Rhetoric 1.4.3 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Deuterium v2019.2.37.73 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Rodent IV 022 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)

The sigma or rating deviation indicates the degree of uncertainty, and mu is the mean skill or rating.

TrueSkill updated rating list after the tournament

Code: Select all

                                     name     rating     sigma
1                  Spike 1.4 UCI_Elo 1500  39.468082  3.877891
2                 Cheese 2.2 UCI_Elo 1500  37.579741  3.237980
3                 Minic 2.51 UCI_Elo 1500  31.911882  2.878162
4                 Amyan 1.72 UCI_Elo 1500  31.390036  2.992734
5             Rhetoric 1.4.3 UCI_Elo 1500  29.043300  2.709709
6                 Ufim v8.02 UCI_Elo 1500  28.067483  2.724383
7                   Wasp 4.5 UCI_Elo 1500  24.020628  2.712146
8               Stockfish 13 UCI_Elo 1500  23.431118  2.894256
9              Rodent IV 022 UCI_Elo 1500  22.783380  2.506261
10               CT800 V1.42 UCI_Elo 1500  22.713217  2.746842
11                Cheng 4.40 UCI_Elo 1500  22.376597  2.520873
12               Arasan 21.1 UCI_Elo 1500  18.383414  2.780081
13  Deuterium v2021.1.38.118 UCI_Elo 1500  17.977033  2.546783
14                 Hiarcs 14 UCI_Elo 1500  16.576647  2.735936
15   Deuterium v2019.2.37.73 UCI_Elo 1500  13.339400  3.455419

Ranking with bayeselo
ResultSet>readpgn tmp_ucielo1500.pgn
105 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>ratings

Code: Select all

Rank Name                                    Elo    +    - games score oppo. draws
   1 Spike 1.4 UCI_Elo 1500                  477  261  261    14  100%   -34    0%
   2 Cheese 2.2 UCI_Elo 1500                 373  220  220    14   93%   -27    0%
   3 Minic 2.51 UCI_Elo 1500                 226  192  192    14   79%   -16    0%
   4 Amyan 1.72 UCI_Elo 1500                 217  190  190    14   82%   -15    7%
   5 Ufim v8.02 UCI_Elo 1500                 101  175  175    14   64%    -7    0%
   6 Rhetoric 1.4.3 UCI_Elo 1500              77  168  168    14   61%    -6    7%
   7 Wasp 4.5 UCI_Elo 1500                   -37  166  166    14   46%     3    7%
   8 Stockfish 13 UCI_Elo 1500               -57  169  169    14   43%     4    0%
   9 Rodent IV 022 UCI_Elo 1500              -63  163  163    14   43%     4   14%
  10 Cheng 4.40 UCI_Elo 1500                 -80  165  165    14   43%     6   14%
  11 CT800 V1.42 UCI_Elo 1500               -143  168  168    14   29%    10   14%
  12 Deuterium v2021.1.38.118 UCI_Elo 1500  -208  181  181    14   25%    15    7%
  13 Arasan 21.1 UCI_Elo 1500               -228  172  172    14   21%    16   14%
  14 Hiarcs 14 UCI_Elo 1500                 -290  181  181    14   14%    21   14%
  15 Deuterium v2019.2.37.73 UCI_Elo 1500   -364  213  213    14    7%    26    0%

Ordo is not shown because there is a perfect player in Spike 1.4 UCI_Elo 1500.

The top 4; rank 7, 8 and 9; and last 2 of trueskill and bayeselo are the same.

I am using this library of trueskill.

Code to process pgn file

Code: Select all

"""
trueskill_rating.py

Read pgn from a single round-robin tournament and create trueskill list.

Limitation:
    Can only handle a single round-robin results.

Requirements:
    * Install Python 3.8 or higher
    * Install trueskill
        pip install trueskill
    * Install pandas
        pip install pandas
    * Install python chess
        pip install chess
"""

from trueskill import Rating, rate_1vs1
import chess
import chess.pgn
import pandas as pd


def get_players(pgnfn):
    """
    Get a list of player names.
    """
    players = []
    with open(pgnfn) as h:
        while True:
            game = chess.pgn.read_game(h)
            if game is None:
                break
            wp = game.headers['White']
            bp = game.headers['Black']
            players.append(wp)
            players.append(bp)
    
    return list(set(players))


def rating_list(pgnfn):
    """
    * Read players and define player objects.
    * Read game results one at a time and update player object.
    * Create a rating list.
    """
    
    players = get_players(pgnfn)

    # Build a dict of trueskill players.
    tsplayers = {}
    for p in players:
        tsplayers.update({p: Rating()})

    print('Initial rating')
    for p, r in tsplayers.items():
        print(f'name: {p}, data: {r}')
    print()

    # Update ratings.
    with open(pgnfn) as h:
        while True:
            game = chess.pgn.read_game(h)
            if game is None:
                break
            wp = game.headers['White']
            bp = game.headers['Black']
            res = game.headers['Result']

            if res == '1-0':
                # define
                wr, br = tsplayers[wp], tsplayers[bp]

                # Send result.
                wr, br = rate_1vs1(wr, br)

                # Update player data.
                tsplayers.update({wp: wr})
                tsplayers.update({bp: br})

            elif res == '0-1':
                br, wr = tsplayers[bp], tsplayers[wp]
                br, wr = rate_1vs1(br, wr)
                tsplayers.update({wp: wr})
                tsplayers.update({bp: br})

            elif res == '1/2-1/2':
                wr, br = tsplayers[wp], tsplayers[bp]
                wr, br = rate_1vs1(wr, br, drawn=True)
                tsplayers.update({wp: wr})
                tsplayers.update({bp: br})

    # Show in console.
    fplayers = []
    for n, r in tsplayers.items():
        name = n
        rating = r.mu
        sigma = r.sigma
        fplayers.append({'name': name, 'rating': rating, 'sigma': sigma})

    print('Updated rating list')
    df = pd.DataFrame(fplayers)
    df = df.sort_values(by=['rating', 'sigma'], ascending=[False, True])
    df = df.reset_index(drop=True)
    df.index += 1
    print(df.to_string())


def main():
    pgnfn = 'tmp_ucielo1500.pgn'
    rating_list(pgnfn)


if __name__ == "__main__":
    main()

Hi Ferdy,

Apologies for connecting with you over a reply to a random post, unfortunately I do not have high enough post count on this forum yet to be able to respond to the PM you sent me directly relating to another thread (a thread I made on cheat detection development).

I'd love to see if there's a way that we can collaborate, me and a friend of mine are building a chess platform and want to nail down cheat detection, and we think we have a good shot at it, but need some help from someone with expertise in chess, statistics, and some programming knowledge.

What's the best way to get in touch with you so that we're not just chatting over Forums? Can you PM me an email I can reach you at or a Discord or LinkedIn or Phone Number I can text you at?

BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?

Re: BayesianElo or Ordo ?