The IPON BayesElo mystery solved.

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

ThatsIt
Posts: 992
Joined: Thu Mar 09, 2006 2:11 pm

Re: The IPON BayesElo mystery solved.

You (or that other guy) didn't even give the number of draws, which is important too.
[...snip...]
Oooh Boy, its not (important).
During the IPON-tests ELO-Stat shows the calc.
Perhabs you should read a bit more careful the
IPON-Conditions from now on ?
lkaufman
Posts: 5981
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: The IPON BayesElo mystery solved.

hgm wrote:I don't get it. The drawValue shouldn't affect the ratings in BayesElo, should it? Given a certain rating difference x, one can calculate the probability for a draw, and it will be higher if drawValue is higher (because it will be equal to F(x+drawValue) - F(x-drawValue), where F is the cumulative Elo distribution). But unless drawValue is ridiculously large, the shape of this draw probability distribution is practically independent of it, as the expression is a quite accurate estimate for 2*drawValue*(d/dx)F(x), i.e. proportional to the Bell-shaped Elo curve itself.

In Bayesian analysis, only the shape of the curve is important, and the absolute magnitude is divided out. If it sees there is a draw game, no matter how small the probability of draw games in general, it will always be taken as evidence that the ratings are close (+/- the SD of the Elo curve). That it judges the draw probability quite small factors out.

That BayesElo tends to strongly compress the rating scale with the standard prior (of 2 draws) if the group has a very wide Elo range (e.g. spread over >1000 Elo) is well known. You only need a single game between a top and a bottom engine, and it will never believe that there difference can be anywhere near 1000 Elo, as it counts two draws between them, and this would get astronomically small probability when the players are >1000 Elo apart, as most Elo models have exponential tails. It rather believes all other differences along the scale are mostly due to luck, than believe those two virtual draws between so widely separated players!
I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws. So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate. You may also be correct that the assumed two virtual draws significantly further compresses the ratings.
Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm

Re: The IPON BayesElo mystery solved.

ThatsIt wrote:
You (or that other guy) didn't even give the number of draws, which is important too.
[...snip...]
Oooh Boy, its not (important).
During the IPON-tests ELO-Stat shows the calc.
Perhabs you should read a bit more careful the
IPON-Conditions from now on ?
Ok, last time for you: in correct rating calculations, _trinomials_ are used, not _binomials_. For simplicity, some approximate trinomials to binomials, but it is an approximation. I was talking of rigorous results (for simple examples only), not EloStat, neither Bayeselo.

Weird folks here.
Kai
ThatsIt
Posts: 992
Joined: Thu Mar 09, 2006 2:11 pm

Re: The IPON BayesElo mystery solved.

lkaufman wrote: [...snip...]
Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
http://www.talkchess.com/forum/viewtopi ... 22&t=41655
The mystery you've told about was displayed by ELO-Stat not by Bayes !

Best wishes,
G.S.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: The IPON BayesElo mystery solved.

lkaufman wrote:I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws. So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate. You may also be correct that the assumed two virtual draws significantly further compresses the ratings.
BayesElo has a command "mm" which is usually called without arguments. If you call "mm 1 1" instead, the program will calculate the real values "advantage" and "drawelo" from the given PGN database, and will use these for subsequent rating calculation. You can also use "mm 0 1" to only let the "drawelo" parameter be recalculated but keep the default for "advantage". I have tried both for the IPON games. There was a small rescaling of rating values, the top 20 engines got about 9-13 ELO points more than before and the bottom 20 engines correspondingly less points. The difference between recalculating "advantage" and keeping its default value was neglible.

And the rating difference between Houdini 2.0 and Komodo 4 remained at 40

Regarding "prior" and the two virtual draws assumed by BayesElo: I don't think this has any measurable influence for the high number of games we have at IPON, CCRL or other bigger engine rating systems. AFAIK these two draws are only added between opponents who actually have played each other, and since most games are played between engines which are not many hundreds of ELO points away from each other, the actual influence of two additional draws should be very small in practice.
lkaufman wrote:Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
Larry, I have two questions:

1) What is the idea you have in mind about "match performance ratings" in chess engine tournaments? How do you think they are, or should be, obtained, given one large PGN file of "all" games?

2) What is that difference between EloStat and BayesElo regarding applicability of "averaging" that you are thinking of?

Sven
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: The IPON BayesElo mystery solved.

ThatsIt wrote:
lkaufman wrote: [...snip...]
Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
http://www.talkchess.com/forum/viewtopi ... 22&t=41655
The mystery you've told about was displayed by ELO-Stat not by Bayes !

Best wishes,
G.S.
Hi Gerhard,

please note: averaging "match performances" (whatever that really is) is wrong in both rating calculation models: EloStat as well as BayesElo. That follows immediately from non-linearity of the percentage expectancy curve. People seem to accept this fact here from time to time but tend to forget it one second after clicking on "Submit"

Sven
hgm
Posts: 27945
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

lkaufman wrote: I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws.
Yes, of course it depends on the percentage of draws. But that is an entirely different matter than that it would depend on the (assumed) drawValue.

With a hyperbolic secant it uses as Elo model, a single draw is equivalent to one loss and one win. (By a mathematical coincidence (d/dx)F(x) = F(x) * (1 - F(-x)). With a Gaussian distribution you would not have that.) This would be true for any assumed 'drawValue'. It just means that 75% obtained by one draw and one win predicts exactly the same as two wins and one loss (after 'expanding' the draw), which is a 66% score!

This conclusion is only dependent on the shape of the rating curve.
So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate.
I am not sure what the problem is, then. If you believe the sech rating model to be correct, this is a true effect (draws are stronger evidence that the players are close in rating that wins and losses with a same total score). Any analysis based on that model should get it too. If you don't believe that, you should base the analysis on another score-vs-Elo curve.
Michel
Posts: 2277
Joined: Mon Sep 29, 2008 1:50 am

Re: The IPON BayesElo mystery solved.

hgm wrote:I don't get it. The drawValue shouldn't affect the ratings in BayesElo, should it? Given a certain rating difference x, one can calculate the probability for a draw, and it will be higher if drawValue is higher (because it will be equal to F(x+drawValue) - F(x-drawValue), where F is the cumulative Elo distribution). But unless drawValue is ridiculously large, the shape of this draw probability distribution is practically independent of it, as the expression is a quite accurate estimate for 2*drawValue*(d/dx)F(x), i.e. proportional to the Bell-shaped Elo curve itself.

In Bayesian analysis, only the shape of the curve is important, and the absolute magnitude is divided out.
That's a very nice observation. I had not noticed the fact that
the value of drawelo falls out of the computations for relatively
small elo differences. To me that is quite surprising.

I just checked it for a tournament I am running and it is indeed true.
Of course the larger drawelo, the smaller the elo differences are allowed
to be for the observation to be correct.
With a hyperbolic secant it uses as Elo model, a single draw is equivalent to one loss and one win. (By a mathematical coincidence (d/dx)F(x) = F(x) * (1 - F(-x)). With a Gaussian distribution you would not have that.) This would be true for any assumed 'drawValue'. It just means that 75% obtained by one draw and one win predicts exactly the same as two wins and one loss (after 'expanding' the draw), which is a 66% score!
Very nice observation too!
lkaufman
Posts: 5981
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: The IPON BayesElo mystery solved.

ThatsIt wrote:
lkaufman wrote: [...snip...]
Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
http://www.talkchess.com/forum/viewtopi ... 22&t=41655
The mystery you've told about was displayed by ELO-Stat not by Bayes !

Best wishes,
G.S.
The phenomenon occurs with both EloStat and BayesElo. With EloStat there is no mystery, it's due to the incorrectness of the model itself, which averages ratings (improperly). With BayesElo the reasons for the discrepancy are much less obvious. I attributed it to PRIOR, but how much of an effect that has is not yet clear to me.
lkaufman
Posts: 5981
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: The IPON BayesElo mystery solved.

Sven Schüle wrote:
lkaufman wrote:I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws. So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate. You may also be correct that the assumed two virtual draws significantly further compresses the ratings.
BayesElo has a command "mm" which is usually called without arguments. If you call "mm 1 1" instead, the program will calculate the real values "advantage" and "drawelo" from the given PGN database, and will use these for subsequent rating calculation. You can also use "mm 0 1" to only let the "drawelo" parameter be recalculated but keep the default for "advantage". I have tried both for the IPON games. There was a small rescaling of rating values, the top 20 engines got about 9-13 ELO points more than before and the bottom 20 engines correspondingly less points. The difference between recalculating "advantage" and keeping its default value was neglible.

And the rating difference between Houdini 2.0 and Komodo 4 remained at 40

Regarding "prior" and the two virtual draws assumed by BayesElo: I don't think this has any measurable influence for the high number of games we have at IPON, CCRL or other bigger engine rating systems. AFAIK these two draws are only added between opponents who actually have played each other, and since most games are played between engines which are not many hundreds of ELO points away from each other, the actual influence of two additional draws should be very small in practice.
lkaufman wrote:Others who have written in this thread about why averaging performances doesn't predict the resultant rating are ignoring the fact that they are calculated by BayesElo. The fact that they don't match outputs of other calculations such as EloStat is irrelevant if BayesElo is used.
Larry, I have two questions:

1) What is the idea you have in mind about "match performance ratings" in chess engine tournaments? How do you think they are, or should be, obtained, given one large PGN file of "all" games?

2) What is that difference between EloStat and BayesElo regarding applicability of "averaging" that you are thinking of?

Sven
I only disagree with your characterization of 9-13 elo points as "neglibible". This is roughly half of the discrepancy we have been talking about.