You seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites B scored 0.5/1 vs C, B has 1 white C score 4.5/5 vs D, C has 5 whitesOrdo anchored at 0.Code: Select all
Head to head statistics: 1) C 163 : 6 (+4,=2,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) D : 5 ( 4, 1, 0), 90.0 : +110, 59, 96.9 B : 1 ( 0, 1, 0), 50.0 : +275, 63, 100.0 2) D 53 : 5 (+0,=1,-4), 10.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 5 ( 0, 1, 4), 10.0 : -110, 59, 3.1 3) A -104 : 3 (+2,=1,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) B : 3 ( 2, 1, 0), 83.3 : +7, 67, 54.4 4) B -112 : 4 (+0,=2,-2), 25.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 1 ( 0, 1, 0), 50.0 : -275, 63, 0.0 A : 3 ( 0, 1, 2), 16.7 : -7, 67, 45.6
Options: -W -D -G
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 C : 163 86 5.0 6 83.3 4 0 2 D : 53 92 0.5 5 10.0 0 4 3 A : -104 101 2.5 3 83.3 2 0 4 B : -112 86 1.0 4 25.0 0 2
Ordo anchored at 0.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 A : 308 154 2.5 3 83.3 2 0 2 C : 26 68 5.0 6 83.3 4 0 3 B : 26 68 1.0 4 25.0 0 2 4 D : -359 140 0.5 5 10.0 0 4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 108 219 181 3 83% 6 33% 2 C 27 168 141 6 83% -117 33% 3 B 6 155 174 4 25% 88 50% 4 D -141 158 210 5 10% 27 20%
Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 57 221 183 3 83% -16 33% 2 C 47 170 142 6 83% -76 33% 3 B -16 157 176 4 25% 55 50% 4 D -88 160 212 5 10% 47 20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
BayesianElo or Ordo ?
Moderator: Ras
-
lkaufman
- Posts: 6260
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: BayesianElo or Ordo ?
Komodo rules!
-
Ferdy
- Posts: 4851
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: BayesianElo or Ordo ?
With more draws the Bayeselo gap decreases but at certain limit.lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pmYou seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites B scored 0.5/1 vs C, B has 1 white C score 4.5/5 vs D, C has 5 whitesOrdo anchored at 0.Code: Select all
Head to head statistics: 1) C 163 : 6 (+4,=2,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) D : 5 ( 4, 1, 0), 90.0 : +110, 59, 96.9 B : 1 ( 0, 1, 0), 50.0 : +275, 63, 100.0 2) D 53 : 5 (+0,=1,-4), 10.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 5 ( 0, 1, 4), 10.0 : -110, 59, 3.1 3) A -104 : 3 (+2,=1,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) B : 3 ( 2, 1, 0), 83.3 : +7, 67, 54.4 4) B -112 : 4 (+0,=2,-2), 25.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 1 ( 0, 1, 0), 50.0 : -275, 63, 0.0 A : 3 ( 0, 1, 2), 16.7 : -7, 67, 45.6
Options: -W -D -G
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 C : 163 86 5.0 6 83.3 4 0 2 D : 53 92 0.5 5 10.0 0 4 3 A : -104 101 2.5 3 83.3 2 0 4 B : -112 86 1.0 4 25.0 0 2
Ordo anchored at 0.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 A : 308 154 2.5 3 83.3 2 0 2 C : 26 68 5.0 6 83.3 4 0 3 B : 26 68 1.0 4 25.0 0 2 4 D : -359 140 0.5 5 10.0 0 4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 108 219 181 3 83% 6 33% 2 C 27 168 141 6 83% -117 33% 3 B 6 155 174 4 25% 88 50% 4 D -141 158 210 5 10% 27 20%
Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 57 221 183 3 83% -16 33% 2 C 47 170 142 6 83% -76 33% 3 B -16 157 176 4 25% 55 50% 4 D -88 160 212 5 10% 47 20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
Score rate at 76% with draws.
Code: Select all
Rank Name Elo + - games score oppo. draws
1 A 79 18 18 1000 76% -79 48%
2 B -79 18 18 1000 24% 79 48%More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.
Code: Select all
Rank Name Elo + - games score oppo. draws
1 A 80 6 6 10000 76% -80 48%
2 B -80 6 6 10000 24% 80 48%BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.
-
lkaufman
- Posts: 6260
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: BayesianElo or Ordo ?
Thanks. So based on this it seems that once the engines have played a thousand games or so (typical in the rating lists), playing more games doesn't have much effect on the scale of the list, only on the accuracy of individual engine ratings. If I assume that with no draws BayesElo matches Ordo with large numbers of games (i.e. 76% wins = 200 elo gap) (is this assumption correct?), then the conclusion is that the contraction from using BayesElo ranges from nothing with no draws to 20% with no wins for the inferior engine (with enough games). If this is accurate then changing the scale of the list will result in overstating the rating differences in the low end (where there are few draws) while still understating them at the high end (where the weaker engine rarely wins). I think it's just not suitable for the actual situation of rating lists where the same 76% score will show anywhere from 160 to 200 elo gap (with a lot of games) depending on draw frequency, which in turn depends on engine strength and time limit.Ferdy wrote: ↑Sun Oct 17, 2021 8:22 pmWith more draws the Bayeselo gap decreases but at certain limit.lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pmYou seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites B scored 0.5/1 vs C, B has 1 white C score 4.5/5 vs D, C has 5 whitesOrdo anchored at 0.Code: Select all
Head to head statistics: 1) C 163 : 6 (+4,=2,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) D : 5 ( 4, 1, 0), 90.0 : +110, 59, 96.9 B : 1 ( 0, 1, 0), 50.0 : +275, 63, 100.0 2) D 53 : 5 (+0,=1,-4), 10.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 5 ( 0, 1, 4), 10.0 : -110, 59, 3.1 3) A -104 : 3 (+2,=1,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) B : 3 ( 2, 1, 0), 83.3 : +7, 67, 54.4 4) B -112 : 4 (+0,=2,-2), 25.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 1 ( 0, 1, 0), 50.0 : -275, 63, 0.0 A : 3 ( 0, 1, 2), 16.7 : -7, 67, 45.6
Options: -W -D -G
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 C : 163 86 5.0 6 83.3 4 0 2 D : 53 92 0.5 5 10.0 0 4 3 A : -104 101 2.5 3 83.3 2 0 4 B : -112 86 1.0 4 25.0 0 2
Ordo anchored at 0.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 A : 308 154 2.5 3 83.3 2 0 2 C : 26 68 5.0 6 83.3 4 0 3 B : 26 68 1.0 4 25.0 0 2 4 D : -359 140 0.5 5 10.0 0 4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 108 219 181 3 83% 6 33% 2 C 27 168 141 6 83% -117 33% 3 B 6 155 174 4 25% 88 50% 4 D -141 158 210 5 10% 27 20%
Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 57 221 183 3 83% -16 33% 2 C 47 170 142 6 83% -76 33% 3 B -16 157 176 4 25% 55 50% 4 D -88 160 212 5 10% 47 20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
Score rate at 76% with draws.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 79 18 18 1000 76% -79 48% 2 B -79 18 18 1000 24% 79 48%
More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.To increase the gap more wins are needed and draws have to be minimized.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 80 6 6 10000 76% -80 48% 2 B -80 6 6 10000 24% 80 48%
BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.
Komodo rules!
-
Ferdy
- Posts: 4851
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: BayesianElo or Ordo ?
Comparison without draws at 76% score rate.lkaufman wrote: ↑Sun Oct 17, 2021 9:48 pmThanks. So based on this it seems that once the engines have played a thousand games or so (typical in the rating lists), playing more games doesn't have much effect on the scale of the list, only on the accuracy of individual engine ratings. If I assume that with no draws BayesElo matches Ordo with large numbers of games (i.e. 76% wins = 200 elo gap) (is this assumption correct?), then the conclusion is that the contraction from using BayesElo ranges from nothing with no draws to 20% with no wins for the inferior engine (with enough games). If this is accurate then changing the scale of the list will result in overstating the rating differences in the low end (where there are few draws) while still understating them at the high end (where the weaker engine rarely wins). I think it's just not suitable for the actual situation of rating lists where the same 76% score will show anywhere from 160 to 200 elo gap (with a lot of games) depending on draw frequency, which in turn depends on engine strength and time limit.Ferdy wrote: ↑Sun Oct 17, 2021 8:22 pmWith more draws the Bayeselo gap decreases but at certain limit.lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pmYou seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites B scored 0.5/1 vs C, B has 1 white C score 4.5/5 vs D, C has 5 whitesOrdo anchored at 0.Code: Select all
Head to head statistics: 1) C 163 : 6 (+4,=2,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) D : 5 ( 4, 1, 0), 90.0 : +110, 59, 96.9 B : 1 ( 0, 1, 0), 50.0 : +275, 63, 100.0 2) D 53 : 5 (+0,=1,-4), 10.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 5 ( 0, 1, 4), 10.0 : -110, 59, 3.1 3) A -104 : 3 (+2,=1,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) B : 3 ( 2, 1, 0), 83.3 : +7, 67, 54.4 4) B -112 : 4 (+0,=2,-2), 25.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 1 ( 0, 1, 0), 50.0 : -275, 63, 0.0 A : 3 ( 0, 1, 2), 16.7 : -7, 67, 45.6
Options: -W -D -G
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 C : 163 86 5.0 6 83.3 4 0 2 D : 53 92 0.5 5 10.0 0 4 3 A : -104 101 2.5 3 83.3 2 0 4 B : -112 86 1.0 4 25.0 0 2
Ordo anchored at 0.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 A : 308 154 2.5 3 83.3 2 0 2 C : 26 68 5.0 6 83.3 4 0 3 B : 26 68 1.0 4 25.0 0 2 4 D : -359 140 0.5 5 10.0 0 4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 108 219 181 3 83% 6 33% 2 C 27 168 141 6 83% -117 33% 3 B 6 155 174 4 25% 88 50% 4 D -141 158 210 5 10% 27 20%
Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 57 221 183 3 83% -16 33% 2 C 47 170 142 6 83% -76 33% 3 B -16 157 176 4 25% 55 50% 4 D -88 160 212 5 10% 47 20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
Score rate at 76% with draws.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 79 18 18 1000 76% -79 48% 2 B -79 18 18 1000 24% 79 48%
More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.To increase the gap more wins are needed and draws have to be minimized.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 80 6 6 10000 76% -80 48% 2 B -80 6 6 10000 24% 80 48%
BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.

-
Ferdy
- Posts: 4851
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: BayesianElo or Ordo ?
Rating gap comparison with Glicko2. If Rd is high the rating gap is also high.CMCanavessi wrote: ↑Sat Oct 16, 2021 9:06 pm What about Glicko, Glicko-2, Trueskill, etc? I can't even find tools to parse a .pgn file with those.

8 games is one rating period starting from r=1500, rd=50, 16 games and others are the same, they started from r=1500, rd=50.
I use the glicko2 module from here.
-
Ferdy
- Posts: 4851
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: BayesianElo or Ordo ?
Ferdy wrote: ↑Mon Oct 18, 2021 4:51 pmRating gap comparison with Glicko2. If Rd is high the rating gap is also high.CMCanavessi wrote: ↑Sat Oct 16, 2021 9:06 pm What about Glicko, Glicko-2, Trueskill, etc? I can't even find tools to parse a .pgn file with those.
8 games is one rating period starting from r=1500, rd=50, 16 games and others are the same, they started from r=1500, rd=50.
I use the glicko2 module from here.
Using TrueSkill from a single round-robin tournament
Initial rating
Code: Select all
name: Cheng 4.40 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Spike 1.4 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Wasp 4.5 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Hiarcs 14 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Stockfish 13 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Cheese 2.2 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: CT800 V1.42 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Arasan 21.1 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Minic 2.51 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Amyan 1.72 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Deuterium v2021.1.38.118 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Ufim v8.02 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Rhetoric 1.4.3 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Deuterium v2019.2.37.73 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
name: Rodent IV 022 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)TrueSkill updated rating list after the tournament
Code: Select all
name rating sigma
1 Spike 1.4 UCI_Elo 1500 39.468082 3.877891
2 Cheese 2.2 UCI_Elo 1500 37.579741 3.237980
3 Minic 2.51 UCI_Elo 1500 31.911882 2.878162
4 Amyan 1.72 UCI_Elo 1500 31.390036 2.992734
5 Rhetoric 1.4.3 UCI_Elo 1500 29.043300 2.709709
6 Ufim v8.02 UCI_Elo 1500 28.067483 2.724383
7 Wasp 4.5 UCI_Elo 1500 24.020628 2.712146
8 Stockfish 13 UCI_Elo 1500 23.431118 2.894256
9 Rodent IV 022 UCI_Elo 1500 22.783380 2.506261
10 CT800 V1.42 UCI_Elo 1500 22.713217 2.746842
11 Cheng 4.40 UCI_Elo 1500 22.376597 2.520873
12 Arasan 21.1 UCI_Elo 1500 18.383414 2.780081
13 Deuterium v2021.1.38.118 UCI_Elo 1500 17.977033 2.546783
14 Hiarcs 14 UCI_Elo 1500 16.576647 2.735936
15 Deuterium v2019.2.37.73 UCI_Elo 1500 13.339400 3.455419Ranking with bayeselo
ResultSet>readpgn tmp_ucielo1500.pgn
105 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>ratings
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Spike 1.4 UCI_Elo 1500 477 261 261 14 100% -34 0%
2 Cheese 2.2 UCI_Elo 1500 373 220 220 14 93% -27 0%
3 Minic 2.51 UCI_Elo 1500 226 192 192 14 79% -16 0%
4 Amyan 1.72 UCI_Elo 1500 217 190 190 14 82% -15 7%
5 Ufim v8.02 UCI_Elo 1500 101 175 175 14 64% -7 0%
6 Rhetoric 1.4.3 UCI_Elo 1500 77 168 168 14 61% -6 7%
7 Wasp 4.5 UCI_Elo 1500 -37 166 166 14 46% 3 7%
8 Stockfish 13 UCI_Elo 1500 -57 169 169 14 43% 4 0%
9 Rodent IV 022 UCI_Elo 1500 -63 163 163 14 43% 4 14%
10 Cheng 4.40 UCI_Elo 1500 -80 165 165 14 43% 6 14%
11 CT800 V1.42 UCI_Elo 1500 -143 168 168 14 29% 10 14%
12 Deuterium v2021.1.38.118 UCI_Elo 1500 -208 181 181 14 25% 15 7%
13 Arasan 21.1 UCI_Elo 1500 -228 172 172 14 21% 16 14%
14 Hiarcs 14 UCI_Elo 1500 -290 181 181 14 14% 21 14%
15 Deuterium v2019.2.37.73 UCI_Elo 1500 -364 213 213 14 7% 26 0%The top 4; rank 7, 8 and 9; and last 2 of trueskill and bayeselo are the same.
I am using this library of trueskill.
Code to process pgn file
Code: Select all
"""
trueskill_rating.py
Read pgn from a single round-robin tournament and create trueskill list.
Limitation:
Can only handle a single round-robin results.
Requirements:
* Install Python 3.8 or higher
* Install trueskill
pip install trueskill
* Install pandas
pip install pandas
* Install python chess
pip install chess
"""
from trueskill import Rating, rate_1vs1
import chess
import chess.pgn
import pandas as pd
def get_players(pgnfn):
"""
Get a list of player names.
"""
players = []
with open(pgnfn) as h:
while True:
game = chess.pgn.read_game(h)
if game is None:
break
wp = game.headers['White']
bp = game.headers['Black']
players.append(wp)
players.append(bp)
return list(set(players))
def rating_list(pgnfn):
"""
* Read players and define player objects.
* Read game results one at a time and update player object.
* Create a rating list.
"""
players = get_players(pgnfn)
# Build a dict of trueskill players.
tsplayers = {}
for p in players:
tsplayers.update({p: Rating()})
print('Initial rating')
for p, r in tsplayers.items():
print(f'name: {p}, data: {r}')
print()
# Update ratings.
with open(pgnfn) as h:
while True:
game = chess.pgn.read_game(h)
if game is None:
break
wp = game.headers['White']
bp = game.headers['Black']
res = game.headers['Result']
if res == '1-0':
# define
wr, br = tsplayers[wp], tsplayers[bp]
# Send result.
wr, br = rate_1vs1(wr, br)
# Update player data.
tsplayers.update({wp: wr})
tsplayers.update({bp: br})
elif res == '0-1':
br, wr = tsplayers[bp], tsplayers[wp]
br, wr = rate_1vs1(br, wr)
tsplayers.update({wp: wr})
tsplayers.update({bp: br})
elif res == '1/2-1/2':
wr, br = tsplayers[wp], tsplayers[bp]
wr, br = rate_1vs1(wr, br, drawn=True)
tsplayers.update({wp: wr})
tsplayers.update({bp: br})
# Show in console.
fplayers = []
for n, r in tsplayers.items():
name = n
rating = r.mu
sigma = r.sigma
fplayers.append({'name': name, 'rating': rating, 'sigma': sigma})
print('Updated rating list')
df = pd.DataFrame(fplayers)
df = df.sort_values(by=['rating', 'sigma'], ascending=[False, True])
df = df.reset_index(drop=True)
df.index += 1
print(df.to_string())
def main():
pgnfn = 'tmp_ucielo1500.pgn'
rating_list(pgnfn)
if __name__ == "__main__":
main()
-
lkaufman
- Posts: 6260
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: BayesianElo or Ordo ?
So it seems the effect of draws on BayesElo is actually twice as large as I thought. Given 1000+ game samples, with no draws BayesElo actually spreads ratings by 20%, whereas with maximal draws it contracts them by 20%. So this should mean that at the low end (especially blitz list), CCRL ratings (which use BayesElo) may actually be expanded, while at the high end they are surely contracted. Scaling can't fix that.Ferdy wrote: ↑Mon Oct 18, 2021 7:01 amComparison without draws at 76% score rate.lkaufman wrote: ↑Sun Oct 17, 2021 9:48 pmThanks. So based on this it seems that once the engines have played a thousand games or so (typical in the rating lists), playing more games doesn't have much effect on the scale of the list, only on the accuracy of individual engine ratings. If I assume that with no draws BayesElo matches Ordo with large numbers of games (i.e. 76% wins = 200 elo gap) (is this assumption correct?), then the conclusion is that the contraction from using BayesElo ranges from nothing with no draws to 20% with no wins for the inferior engine (with enough games). If this is accurate then changing the scale of the list will result in overstating the rating differences in the low end (where there are few draws) while still understating them at the high end (where the weaker engine rarely wins). I think it's just not suitable for the actual situation of rating lists where the same 76% score will show anywhere from 160 to 200 elo gap (with a lot of games) depending on draw frequency, which in turn depends on engine strength and time limit.Ferdy wrote: ↑Sun Oct 17, 2021 8:22 pmWith more draws the Bayeselo gap decreases but at certain limit.lkaufman wrote: ↑Sun Oct 17, 2021 5:52 pmYou seem to be suggesting that with enough games, the compression disappears. But I think it has to do with draws. If A plays one million games with B and scores 76%, under the Elo system (Ordo or even Elostat, doesn't matter) A will end up 200 above B, regardless of whether it was 760000 wins to 240000 losses or 520000 wins and 480000 draws (or anything in between). But with BayesElo, the rating gap will be less than 200 by an amount depending on the number of draws, unless I am badly mistaken. Am I wrong?Ferdy wrote: ↑Sun Oct 17, 2021 12:32 pm A sample scenario where bayeselo is I believe better than ordo.
Code: Select all
A scored 2.5/3 vs B, A has 3 whites B scored 0.5/1 vs C, B has 1 white C score 4.5/5 vs D, C has 5 whitesOrdo anchored at 0.Code: Select all
Head to head statistics: 1) C 163 : 6 (+4,=2,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) D : 5 ( 4, 1, 0), 90.0 : +110, 59, 96.9 B : 1 ( 0, 1, 0), 50.0 : +275, 63, 100.0 2) D 53 : 5 (+0,=1,-4), 10.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 5 ( 0, 1, 4), 10.0 : -110, 59, 3.1 3) A -104 : 3 (+2,=1,-0), 83.3 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) B : 3 ( 2, 1, 0), 83.3 : +7, 67, 54.4 4) B -112 : 4 (+0,=2,-2), 25.0 % vs. : games ( +, =, -), (%) : Diff, SD, CFS (%) C : 1 ( 0, 1, 0), 50.0 : -275, 63, 0.0 A : 3 ( 0, 1, 2), 16.7 : -7, 67, 45.6
Options: -W -D -G
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 C : 163 86 5.0 6 83.3 4 0 2 D : 53 92 0.5 5 10.0 0 4 3 A : -104 101 2.5 3 83.3 2 0 4 B : -112 86 1.0 4 25.0 0 2
Ordo anchored at 0.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W L 1 A : 308 154 2.5 3 83.3 2 0 2 C : 26 68 5.0 6 83.3 4 0 3 B : 26 68 1.0 4 25.0 0 2 4 D : -359 140 0.5 5 10.0 0 4
Bayeselo:
advantage of playing first = 0 (default)
drawelo = 0 (default)
I like the ranking of bayeselo here "A" defeated "B" and "B" was able to equalize "C" once.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 108 219 181 3 83% 6 33% 2 C 27 168 141 6 83% -117 33% 3 B 6 155 174 4 25% 88 50% 4 D -141 158 210 5 10% 27 20%
Bayeselo:
advantage of playing first = 33
Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 57 221 183 3 83% -16 33% 2 C 47 170 142 6 83% -76 33% 3 B -16 157 176 4 25% 55 50% 4 D -88 160 212 5 10% 47 20%
The so called elo compression in bayeselo is like favoring a pessimistic behavior. Testing player x and y, initially they are assumed to be close in strength. As the actual results are coming in, the ratings are then adjusted. In order to determine the best, one has to show more wins in as many games as possible to stay away from the initial assumption. I prefer this behavior.
Score rate at 76% with draws.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 79 18 18 1000 76% -79 48% 2 B -79 18 18 1000 24% 79 48%
More games and draws but score rate is the same. Rating gap is the same but with tighter confidence interval.To increase the gap more wins are needed and draws have to be minimized.Code: Select all
Rank Name Elo + - games score oppo. draws 1 A 80 6 6 10000 76% -80 48% 2 B -80 6 6 10000 24% 80 48%
BTW the bayeselo program has the command to change the scale of rating thereby changing the gap according to Ordo and Elo.
![]()
Komodo rules!
-
ArthurKanishov
- Posts: 3
- Joined: Wed Oct 20, 2021 5:16 pm
- Full name: Arthur Kanishov
Re: BayesianElo or Ordo ?
Hi Ferdy,Ferdy wrote: ↑Tue Oct 19, 2021 4:10 amFerdy wrote: ↑Mon Oct 18, 2021 4:51 pmRating gap comparison with Glicko2. If Rd is high the rating gap is also high.CMCanavessi wrote: ↑Sat Oct 16, 2021 9:06 pm What about Glicko, Glicko-2, Trueskill, etc? I can't even find tools to parse a .pgn file with those.
8 games is one rating period starting from r=1500, rd=50, 16 games and others are the same, they started from r=1500, rd=50.
I use the glicko2 module from here.
Using TrueSkill from a single round-robin tournament
Initial ratingThe sigma or rating deviation indicates the degree of uncertainty, and mu is the mean skill or rating.Code: Select all
name: Cheng 4.40 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Spike 1.4 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Wasp 4.5 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Hiarcs 14 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Stockfish 13 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Cheese 2.2 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: CT800 V1.42 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Arasan 21.1 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Minic 2.51 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Amyan 1.72 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Deuterium v2021.1.38.118 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Ufim v8.02 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Rhetoric 1.4.3 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Deuterium v2019.2.37.73 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333) name: Rodent IV 022 UCI_Elo 1500, data: trueskill.Rating(mu=25.000, sigma=8.333)
TrueSkill updated rating list after the tournamentCode: Select all
name rating sigma 1 Spike 1.4 UCI_Elo 1500 39.468082 3.877891 2 Cheese 2.2 UCI_Elo 1500 37.579741 3.237980 3 Minic 2.51 UCI_Elo 1500 31.911882 2.878162 4 Amyan 1.72 UCI_Elo 1500 31.390036 2.992734 5 Rhetoric 1.4.3 UCI_Elo 1500 29.043300 2.709709 6 Ufim v8.02 UCI_Elo 1500 28.067483 2.724383 7 Wasp 4.5 UCI_Elo 1500 24.020628 2.712146 8 Stockfish 13 UCI_Elo 1500 23.431118 2.894256 9 Rodent IV 022 UCI_Elo 1500 22.783380 2.506261 10 CT800 V1.42 UCI_Elo 1500 22.713217 2.746842 11 Cheng 4.40 UCI_Elo 1500 22.376597 2.520873 12 Arasan 21.1 UCI_Elo 1500 18.383414 2.780081 13 Deuterium v2021.1.38.118 UCI_Elo 1500 17.977033 2.546783 14 Hiarcs 14 UCI_Elo 1500 16.576647 2.735936 15 Deuterium v2019.2.37.73 UCI_Elo 1500 13.339400 3.455419
Ranking with bayeselo
ResultSet>readpgn tmp_ucielo1500.pgn
105 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm
00:00:00,00
ResultSet-EloRating>ratingsOrdo is not shown because there is a perfect player in Spike 1.4 UCI_Elo 1500.Code: Select all
Rank Name Elo + - games score oppo. draws 1 Spike 1.4 UCI_Elo 1500 477 261 261 14 100% -34 0% 2 Cheese 2.2 UCI_Elo 1500 373 220 220 14 93% -27 0% 3 Minic 2.51 UCI_Elo 1500 226 192 192 14 79% -16 0% 4 Amyan 1.72 UCI_Elo 1500 217 190 190 14 82% -15 7% 5 Ufim v8.02 UCI_Elo 1500 101 175 175 14 64% -7 0% 6 Rhetoric 1.4.3 UCI_Elo 1500 77 168 168 14 61% -6 7% 7 Wasp 4.5 UCI_Elo 1500 -37 166 166 14 46% 3 7% 8 Stockfish 13 UCI_Elo 1500 -57 169 169 14 43% 4 0% 9 Rodent IV 022 UCI_Elo 1500 -63 163 163 14 43% 4 14% 10 Cheng 4.40 UCI_Elo 1500 -80 165 165 14 43% 6 14% 11 CT800 V1.42 UCI_Elo 1500 -143 168 168 14 29% 10 14% 12 Deuterium v2021.1.38.118 UCI_Elo 1500 -208 181 181 14 25% 15 7% 13 Arasan 21.1 UCI_Elo 1500 -228 172 172 14 21% 16 14% 14 Hiarcs 14 UCI_Elo 1500 -290 181 181 14 14% 21 14% 15 Deuterium v2019.2.37.73 UCI_Elo 1500 -364 213 213 14 7% 26 0%
The top 4; rank 7, 8 and 9; and last 2 of trueskill and bayeselo are the same.
I am using this library of trueskill.
Code to process pgn fileCode: Select all
""" trueskill_rating.py Read pgn from a single round-robin tournament and create trueskill list. Limitation: Can only handle a single round-robin results. Requirements: * Install Python 3.8 or higher * Install trueskill pip install trueskill * Install pandas pip install pandas * Install python chess pip install chess """ from trueskill import Rating, rate_1vs1 import chess import chess.pgn import pandas as pd def get_players(pgnfn): """ Get a list of player names. """ players = [] with open(pgnfn) as h: while True: game = chess.pgn.read_game(h) if game is None: break wp = game.headers['White'] bp = game.headers['Black'] players.append(wp) players.append(bp) return list(set(players)) def rating_list(pgnfn): """ * Read players and define player objects. * Read game results one at a time and update player object. * Create a rating list. """ players = get_players(pgnfn) # Build a dict of trueskill players. tsplayers = {} for p in players: tsplayers.update({p: Rating()}) print('Initial rating') for p, r in tsplayers.items(): print(f'name: {p}, data: {r}') print() # Update ratings. with open(pgnfn) as h: while True: game = chess.pgn.read_game(h) if game is None: break wp = game.headers['White'] bp = game.headers['Black'] res = game.headers['Result'] if res == '1-0': # define wr, br = tsplayers[wp], tsplayers[bp] # Send result. wr, br = rate_1vs1(wr, br) # Update player data. tsplayers.update({wp: wr}) tsplayers.update({bp: br}) elif res == '0-1': br, wr = tsplayers[bp], tsplayers[wp] br, wr = rate_1vs1(br, wr) tsplayers.update({wp: wr}) tsplayers.update({bp: br}) elif res == '1/2-1/2': wr, br = tsplayers[wp], tsplayers[bp] wr, br = rate_1vs1(wr, br, drawn=True) tsplayers.update({wp: wr}) tsplayers.update({bp: br}) # Show in console. fplayers = [] for n, r in tsplayers.items(): name = n rating = r.mu sigma = r.sigma fplayers.append({'name': name, 'rating': rating, 'sigma': sigma}) print('Updated rating list') df = pd.DataFrame(fplayers) df = df.sort_values(by=['rating', 'sigma'], ascending=[False, True]) df = df.reset_index(drop=True) df.index += 1 print(df.to_string()) def main(): pgnfn = 'tmp_ucielo1500.pgn' rating_list(pgnfn) if __name__ == "__main__": main()
Apologies for connecting with you over a reply to a random post, unfortunately I do not have high enough post count on this forum yet to be able to respond to the PM you sent me directly relating to another thread (a thread I made on cheat detection development).
I'd love to see if there's a way that we can collaborate, me and a friend of mine are building a chess platform and want to nail down cheat detection, and we think we have a good shot at it, but need some help from someone with expertise in chess, statistics, and some programming knowledge.
What's the best way to get in touch with you so that we're not just chatting over Forums? Can you PM me an email I can reach you at or a Discord or LinkedIn or Phone Number I can text you at?