With the help of Mark Watkins, I can now present the solution to the question of why the IPON ratings for top engines keep coming out significantly lower than the average of the performance ratings.
The answer is pretty clear. BayesElo (which IPON uses) has a parameter called "drawelo". I assume (unless Ingo says otherwise) that IPON uses the default value. This default value was taken from a study of a database with a rather low percentage of draws. The frequency of draws in the IPON games is substantially higher than the frequency that is implied by the default value. The drawelo parameter would have to be much higher to reflect the actual draw percentage in the IPON data. The consequence of using a too-low value for "drawelo" is that the ratings get contracted towards the mean. This is exactly what we are observing. Mystery solved!
This means that all the talk about performance ratings being meaningless or "prior" being important (I'm guilty on that one) was nonsense. If "drawelo" actually matched the data, averaging the performance ratings would come quite close to predicting the final rating.
So now the question is whether IPON should "fix" the problem by using a drawelo value that corresponds to the data. It's a matter of opinion, but my opinion would be to leave things as they are. The reasons for this are twofold:
1. Bayeselo and Elostat would be much farther apart in general if not for the drawelo problem. The reason is that elostat compresses the ratings for a completely different reason, having to do with the incorrectness of averaging ratings. Purely by coincidence, the compression of bayeselo ratings caused by the drawelo default use is roughly the same as the compression in ratings caused by using elostat (given the IPON data), so huge disparities in general are avoided by this "error".
2. Engine vs. engine ratings overstate rating differences in terms of how the engines would perform against humans. The artificial compression caused by using drawelo default accidentally makes the ratings more realistic (relative to one another) in terms of how they would perform against the top human players.
So, although the use of the default is an "error", I say "leave it alone"! Presumably the above also applies to CCRL and to any rating groups that use Bayeselo with default values.
Larry
The IPON BayesElo mystery solved.
Moderators: hgm, Rebel, chrisw
-
- Posts: 5982
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
-
- Posts: 992
- Joined: Thu Mar 09, 2006 2:11 pm
Re: The IPON BayesElo mystery solved.
There is no mystery at all.
During an IPON-run you see the following (example):
If you made the blunder to calculate:
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.
But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.
The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.
Best wishes,
G.S.
During an IPON-run you see the following (example):
Code: Select all
Engine A vs Engine B (ELO 2755) 60.0-40.0 perf=2825
Engine A vs Engine C (ELO 2705) 82.0-18.0 perf=2968
Engine A vs Engine D (ELO 2815) 62.0-38.0 perf=2900
Engine A vs Engine E (ELO 2800) 65.0-35.0 perf=2908
Engine A vs Engine F (ELO 2680) 85.0-15.0 perf=2981
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.
But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.
The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.
Best wishes,
G.S.
-
- Posts: 10428
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: The IPON BayesElo mystery solved.
The last calculation is not correct.ThatsIt wrote:There is no mystery at all.
During an IPON-run you see the following (example):If you made the blunder to calculate:Code: Select all
Engine A vs Engine B (ELO 2755) 60.0-40.0 perf=2825 Engine A vs Engine C (ELO 2705) 82.0-18.0 perf=2968 Engine A vs Engine D (ELO 2815) 62.0-38.0 perf=2900 Engine A vs Engine E (ELO 2800) 65.0-35.0 perf=2908 Engine A vs Engine F (ELO 2680) 85.0-15.0 perf=2981
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.
But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.
The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.
Best wishes,
G.S.
scoring 354 out of 500 against a fixed rating of 2751 is performance of 2905
When you have opponents with different level it is clearly harder to score 354 out of 500
For example imagine that 300 games are against an opponent with rating of 2905 and 200 games are against an opponent with rating 2520 so the average of the opponents is 2751
2905 player is expected to score 150 out of 300 in the first 300 games
and it is impossible for him to score 204 out of 200 in the rest of the games.
It means that practically 354 out of 500 against the opponents in my example is performance that is clearly higher than 2905.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: The IPON BayesElo mystery solved.
LOLThatsIt wrote:There is no mystery at all.
During an IPON-run you see the following (example):If you made the blunder to calculate:Code: Select all
Engine A vs Engine B (ELO 2755) 60.0-40.0 perf=2825 Engine A vs Engine C (ELO 2705) 82.0-18.0 perf=2968 Engine A vs Engine D (ELO 2815) 62.0-38.0 perf=2900 Engine A vs Engine E (ELO 2800) 65.0-35.0 perf=2908 Engine A vs Engine F (ELO 2680) 85.0-15.0 perf=2981
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.
But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.
The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.
Best wishes,
G.S.
You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number. Maybe I will try to do some simple cases rigorously using trinomials with "Mathematica", to see what happens, but what Larry said makes some sense.
Kai
-
- Posts: 992
- Joined: Thu Mar 09, 2006 2:11 pm
Re: The IPON BayesElo mystery solved.
Hi Uri !
The issue is the discrepancy, no more, no less !
And, the calc is correct:
@ Kai = no need to answer such a post.
Best wishes,
G.S.
The issue is the discrepancy, no more, no less !
And, the calc is correct:
Code: Select all
Wins = 300
Draws = 108
Losses = 92
Av.Op. Elo = 2751
Result : 354.0/500 (+300,=108,-92)
Perf. : 70.8 %
Margins :
68 % : (+ 1.7,- 1.8 %) -> [ 69.0, 72.5 %]
95 % : (+ 3.3,- 3.5 %) -> [ 67.3, 74.1 %]
99.7 % : (+ 5.0,- 5.4 %) -> [ 65.4, 75.8 %]
Elo : 2905
Margins :
68 % : (+ 15,- 15) -> [2890,2919]
95 % : (+ 29,- 29) -> [2876,2934]
99.7 % : (+ 44,- 43) -> [2862,2949]
Best wishes,
G.S.
-
- Posts: 543
- Joined: Mon Jul 05, 2010 10:27 pm
Re: The IPON BayesElo mystery solved.
you are the typical guy speaking about something you think you understand, and while doing that embarrassing yourself. Are you a politician?Kai Laskos wrote:
You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number. Maybe I will try to do some simple cases rigorously using trinomials with "Mathematica", to see what happens, but what Larry said makes some sense.
Kai
@Uri:
you have a point, but because the amount fo games are always the same against all engines, there is no problem to average all the oponents ELO (not the performance!) and then combine with the results to get the general performance elo.
Dont know basileo and i dont know that "drawelo" ... probably there is a mistake in the algorithm. But the most common problem is people averaging performance.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: The IPON BayesElo mystery solved.
Sorry, when the things are really hard, and I see you or that other guy making a joke of a derivation, I can only say that you are both free to speak here, it's a free forum, for all sorts of folks like you. At least I didn't say it's clear to me what happens.IGarcia wrote:you are the typical guy speaking about something you think you understand, and while doing that embarrassing yourself. Are you a politician?Kai Laskos wrote:
You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number. Maybe I will try to do some simple cases rigorously using trinomials with "Mathematica", to see what happens, but what Larry said makes some sense.
Kai
@Uri:
you have a point, but because the amount fo games are always the same against all engines, there is no problem to average all the oponents ELO (not the performance!) and then combine with the results to get the general performance elo.
Dont know basileo and i dont know that "drawelo" ... probably there is a mistake in the algorithm. But the most common problem is people averaging performance.
Kai
-
- Posts: 543
- Joined: Mon Jul 05, 2010 10:27 pm
Re: The IPON BayesElo mystery solved.
sure its free to speak (write). Still is hard to read:Laskos wrote: Sorry, when the things are really hard, and I see you or that other guy making a joke of a derivation, I can only say that you are both free to speak here, it's a free forum, for all sorts of folks like you.
Kai
as a correct answer of a problem. please don't take it personal.Laskos wrote: "You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number"
regards.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: The IPON BayesElo mystery solved.
Why it's hard to read? One cannot add there simply numbers, one has to make convolutions there of trinomial distributions, and use cumulative distribution functions. You (or that other guy) didn't even give the number of draws, which is important too. I don't know the answer, but I know that even using "Mathematica", the rigorous result is not straightforward.IGarcia wrote:sure its free to speak (write). Still is hard to read:Laskos wrote: Sorry, when the things are really hard, and I see you or that other guy making a joke of a derivation, I can only say that you are both free to speak here, it's a free forum, for all sorts of folks like you.
Kai
as a correct answer of a problem. please don't take it personal.Laskos wrote: "You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number"
regards.
Kai
-
- Posts: 27986
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: The IPON BayesElo mystery solved.
I don't get it. The drawValue shouldn't affect the ratings in BayesElo, should it? Given a certain rating difference x, one can calculate the probability for a draw, and it will be higher if drawValue is higher (because it will be equal to F(x+drawValue) - F(x-drawValue), where F is the cumulative Elo distribution). But unless drawValue is ridiculously large, the shape of this draw probability distribution is practically independent of it, as the expression is a quite accurate estimate for 2*drawValue*(d/dx)F(x), i.e. proportional to the Bell-shaped Elo curve itself.
In Bayesian analysis, only the shape of the curve is important, and the absolute magnitude is divided out. If it sees there is a draw game, no matter how small the probability of draw games in general, it will always be taken as evidence that the ratings are close (+/- the SD of the Elo curve). That it judges the draw probability quite small factors out.
That BayesElo tends to strongly compress the rating scale with the standard prior (of 2 draws) if the group has a very wide Elo range (e.g. spread over >1000 Elo) is well known. You only need a single game between a top and a bottom engine, and it will never believe that there difference can be anywhere near 1000 Elo, as it counts two draws between them, and this would get astronomically small probability when the players are >1000 Elo apart, as most Elo models have exponential tails. It rather believes all other differences along the scale are mostly due to luck, than believe those two virtual draws between so widely separated players!
In Bayesian analysis, only the shape of the curve is important, and the absolute magnitude is divided out. If it sees there is a draw game, no matter how small the probability of draw games in general, it will always be taken as evidence that the ratings are close (+/- the SD of the Elo curve). That it judges the draw probability quite small factors out.
That BayesElo tends to strongly compress the rating scale with the standard prior (of 2 draws) if the group has a very wide Elo range (e.g. spread over >1000 Elo) is well known. You only need a single game between a top and a bottom engine, and it will never believe that there difference can be anywhere near 1000 Elo, as it counts two draws between them, and this would get astronomically small probability when the players are >1000 Elo apart, as most Elo models have exponential tails. It rather believes all other differences along the scale are mostly due to luck, than believe those two virtual draws between so widely separated players!