The IPON BayesElo mystery solved.

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw, Ras, hgm, chrisw, Rebel, Ras

User avatar
hgm
Posts: 28268
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

Post by hgm »

Indeed, this function seems to have many amazing properties. The derivative (a hyperbolic secant) is also an eigen-function of the Fourier transform. (For Gaussians this is well known, but for me it was a surprise sech is too.)
User avatar
hgm
Posts: 28268
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The IPON BayesElo mystery solved.

Post by hgm »

lkaufman wrote:It suddenly occurred to me that if BayesElo is correct, then when doing normal sequential ratings (like USCF or FIDE) wouldn't it be correct to rate each draw twice, or alternatively to only half-rate wins and losses? That doesn't sound right, but it would seem to be logical if the underlying assumption of BayesElo is right. Only in that way would one draw = one win plus one loss.
This sounds reasonable. But I thought FIDE ratings were based on a Gaussian model, and I never calculated how it would work out for that. But it canot be excluded draws should have a different weight from wins or losses.

Of course Sonas showed that the FIDE model is no good at all, so it could use a lot more changes as just the draw weight.
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: The IPON BayesElo mystery solved.

Post by IWB »

Hello Larry,

It is even more complicated!

1. Yes, for my final rating list I use Bayeselo with default settings!
2. During the run of an engine, I use the automatic calculation of the Shredder GUI which is the pur Elo formula seen here: http://en.wikipedia.org/wiki/Elo_rating_system in the section "Theory".
3. When I start a tourney I entery the last bayeselo number in the programm which is now calculating pure Elos. (Knowing that it doesnt fit together 100%)

For me, as a "mathematicval iliterate", the conclusion is easy:
The overall perfomance (lower end of the live list) fits magicaly very good into the later computed bayeselo result (I do not remember a 5 Elo difference!). The individual results are so random with its low 100 games that I do not care about them (knowing that some do)!

For me there is no riddle at all.

Regarding the draw rate, I have this for the latest IPON list:
  • Games : 161200 (finished)

    White Wins : 60011 (37.2 %)
    Black Wins : 43748 (27.1 %)
    Draws : 57441 (35.6 %)

    White Perf. : 55.0 %
    Black Perf. : 45.0 %
Comparing this percentages with other lists, which I use to check my results more or less continuesly, I am absolutly in line, so I do not have any reason to be worried about my result :-)

This might be complety mathematicaly ignorant, but ... it makes sence - to me :-)

Have a nice 2012
INgo
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: The IPON BayesElo mystery solved.

Post by michiguel »

hgm wrote:I never heard the term 'Logistic' before, but I assume this is the expression

score = 100%/(1+exp(-EloDifference/WIDTH)) =F(EloDifference)
This equation could come from Gibbs energy, Boltzmann distribution etc. etc.

F = 100%/(1+exp(-E/RT))

F = percentage of a molecule at a given state A out of two possible states (A and B)
E = Difference of Gibbs Energy between A and B
RT = Gas constant * Temperature

I reached this logistic equation from a different model, looking at strength as energy.

Miguel
which is indeed a commonly used alternative for the Gaussian (normal) model (using WIDTH = 400/ln(10), so the exponential becomes 10^(-EloDif/400)). This is what BayesElo uses, except that it does not only model the average sore, but wins, draws and losses separately, by

wins = F(EloDiff - drawValue)

from which it automatically follows (asone player's win is another player's loss)

losses = F(-EloDiff - drawValue) = 100% - F(EloDiff + drawValue)

(F(-x) = 1 - F(x) for all x)

and thus

draws =100% - wins - losses = F(EloDiff + drawValue) - F(EloDiff - drawValue)

From this it follows that a draw between twoplayers is twice as strong evidence for their equality than a win or loss is for their unequality. But that conclusion is of course only as good as the model predicting the WDL frequencies was.

The only way to say anything sensible about that is actually plotting win, draw and loss frequencies as a function of Elo difference, (e.g. take a huge set of games, calculate the ratings, divide the games over rating-difference bins, and calculate the WDL percentage for each bin, and plot the results inthe same graph as F(EloDiff).)

If this confirms the model,theratings were OK. If not, you should repeat the rating calculation with an improved model, plot the results again using the new ratings, and so on, until you reach consistency.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: The IPON BayesElo mystery solved.

Post by michiguel »

lkaufman wrote:
michiguel wrote:
lkaufman wrote:I do not know what BayesELO is doing, but the difference here could be because of a use of a different formula. Still, note that there are minor differences in between, and some engines that are 9th are 10th etc.

I believe that this shows that fighting for 5 points elo or so is not worth it. It is a meaningless difference, IMHO. It only means something in head to head competition (with enough games, of course).

Miguel
First point: there are two versions of the elo formula, going back to the publication of his book on ratings. One uses the normal distribution, and the other uses the Logistic, exactly the one you say your model uses.
I always thought that ELO used a Gaussian curve, and wikipedia seems to confirm that, but they may be wrong.
Some rating agencies use normal, others use Logistic. I believe at least the USCF uses Logistic. I don't know whether Bayeselo uses normal or logistic. If Logistic, it should match your formula exactly or almost exactly with the proper options set and no PRIOR.
If the convergence for both methods are equally good. With the BayesELO, can you calculate the performance of a single engine and give you the same rating?
As you say, the differences between the two distributions are in general pretty tiny, except at the extremes. So you have independently re-invented one version of the Elo formula. I did the same in the early 1970s, before the book was published; I was using the Logistic version of "Elo" to rate blitz events before he published his book in which that idea was introduced.
I disagree about five points being "meaningless". It doesn't matter whether you play head to head or against identical opponents, if you play enough games it will have real meaning. Given the samples that the rating agencies actually play, its meaning is limited. But it's rather moot; we strive to gain five points because if we do it ten times, we have fifty points! We don't release a new version if it's just five or ten elo better in our judgment.
but that is 5 elo measured in your own constant set up. Yes, that is meaningful! but drop the engine in an pool of different engines, and different models could give you slightly different answers. Gaussian? Logistic? something else? Why one is correct and the other not? At that point, 5 elo may reflect no only the noise, but also a systematic error in model, engines picked, etc.

Miguel
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: The IPON BayesElo mystery solved.

Post by Sven »

michiguel wrote:I always thought that ELO used a Gaussian curve, and wikipedia seems to confirm that, but they may be wrong.
Arpad E. Elo used a Gaussian curve. I have read his book. FIDE rating system is still using that. Long ago USCF already switched to the logistic curve, based on findings that it were more appropriate for chess ratings. BayesElo definitely uses the logistic curve. I have not found out whether this applies to EloStat, too, but I'm pretty sure it does.

The logistic curve is almost equal to the Gaussian curve within the range of roughly +/- 230 ELO points, then they start to diverge slightly, with win expectancy differences of more than 0.01 for ELO differences of more than roughly 400 points, and even more divergence for more extreme ELO differences.

Sven
lkaufman
Posts: 6114
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: The IPON BayesElo mystery solved.

Post by lkaufman »

hgm wrote:
lkaufman wrote:It suddenly occurred to me that if BayesElo is correct, then when doing normal sequential ratings (like USCF or FIDE) wouldn't it be correct to rate each draw twice, or alternatively to only half-rate wins and losses? That doesn't sound right, but it would seem to be logical if the underlying assumption of BayesElo is right. Only in that way would one draw = one win plus one loss.
This sounds reasonable. But I thought FIDE ratings were based on a Gaussian model, and I never calculated how it would work out for that. But it canot be excluded draws should have a different weight from wins or losses.

Of course Sonas showed that the FIDE model is no good at all, so it could use a lot more changes as just the draw weight.
FIDE ratings may use the normal distribution rather than the Logistic, I'm not sure, but the differences are insignificant except for extreme elo differences. But they consider only the player's score against the given opponents, not how it was composed of wins and draws, so it therefore follows that for FIDE one win plus one loss is identical to two draws. If the model behind BayesElo implies that one win plus one loss is the same as one draw, that is a HUGE difference. Ironically, if FIDE were to switch to counting draws twice to conform to BayesElo, the effect would be to favor top players who had less draws. These days, many events award prizes based on effectively giving each player only 1/3 of a point for draws. So a revised Bayes-Elo like FIDE rating system would accidentally work to favor the strong players who made less draws, the same ones favored by this prize-distribution! Of course it would have the opposite effect for tail-enders, but their ratings are generally considered less important.
This idea of double-rating draws is not of just hypothetical importance to me. As a member of the USCF rating committee, I am in a position to propose the idea for serious consideration. But one thing about it bothers me. Let's say K (the limiting value of a loss against a very weak opponent) is 32. Normally a huge upset win would get 32 and a huge upset draw would get 16, but if we give draws double credit they also get 32, which is absurd. So maybe double credit for draws just works if the players are closely matched, and the multiplier gradually decays to 1 as the rating spread increases. Does this sound right to you? Can you make a better proposal for modifying sequential ratings in the spirit of BayesElo?
lkaufman
Posts: 6114
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: The IPON BayesElo mystery solved.

Post by lkaufman »

IWB wrote:Hello Larry,

It is even more complicated!

1. Yes, for my final rating list I use Bayeselo with default settings!
2. During the run of an engine, I use the automatic calculation of the Shredder GUI which is the pur Elo formula seen here: http://en.wikipedia.org/wiki/Elo_rating_system in the section "Theory".
3. When I start a tourney I entery the last bayeselo number in the programm which is now calculating pure Elos. (Knowing that it doesnt fit together 100%)

For me, as a "mathematicval iliterate", the conclusion is easy:
The overall perfomance (lower end of the live list) fits magicaly very good into the later computed bayeselo result (I do not remember a 5 Elo difference!). The individual results are so random with its low 100 games that I do not care about them (knowing that some do)!

For me there is no riddle at all.

Regarding the draw rate, I have this for the latest IPON list:
  • Games : 161200 (finished)

    White Wins : 60011 (37.2 %)
    Black Wins : 43748 (27.1 %)
    Draws : 57441 (35.6 %)

    White Perf. : 55.0 %
    Black Perf. : 45.0 %
Comparing this percentages with other lists, which I use to check my results more or less continuesly, I am absolutly in line, so I do not have any reason to be worried about my result :-)

This might be complety mathematicaly ignorant, but ... it makes sence - to me :-)

Have a nice 2012
INgo
The individual performances in 100 games are unimportant, but people are concerned that the final ratings of top engines are clearly well below the average of these numbers. I believe I have explained why in this thread. You need not change anything; I'm only giving you a way to answer such complaints!

Your draw percentage is normal, but it is much higher than the draw percentage assumed by BayesElo. This is the main cause of the above phenomenon. You could fix it by using the option that lets BayesElo calculate the "DrawElo" from your own data, but I'm not recommending that you do so. Reducing or eliminatin "PRIOR" would also help, but again I'm not recommending that.

Have a great 2012 yourself! I hope we have something improved for you to test in the near future.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: The IPON BayesElo mystery solved.

Post by Sven »

lkaufman wrote: FIDE ratings may use the normal distribution rather than the Logistic, I'm not sure, but the differences are insignificant except for extreme elo differences. But they consider only the player's score against the given opponents, not how it was composed of wins and draws, so it therefore follows that for FIDE one win plus one loss is identical to two draws. If the model behind BayesElo implies that one win plus one loss is the same as one draw, that is a HUGE difference. Ironically, if FIDE were to switch to counting draws twice to conform to BayesElo, the effect would be to favor top players who had less draws. [...]
As far as I understood BayesElo does not change the weight of draws for calculating the rating but just for the uncertainty values. I may be wrong but that is how I always perceived it. To be sure about it you can look at the GPL source code at the BayesElo page of its author Rémi Coulom.

Sven
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: The IPON BayesElo mystery solved.

Post by Laskos »

lkaufman wrote:
hgm wrote:
lkaufman wrote: I won't argue about what Bayeselo "should" do, but the bottom line is that the assumed drawelo value has a major effect. Mark Watkins gave me extreme example showing that for a 76% score, which "should" give 200 elo difference, Bayeselo with default values will output anywhere from about 160 to 240 elo difference depending on the percentage of draws.
Yes, of course it depends on the percentage of draws. But that is an entirely different matter than that it would depend on the (assumed) drawValue.

With a hyperbolic secant it uses as Elo model, a single draw is equivalent to one loss and one win. (By a mathematical coincidence (d/dx)F(x) = F(x) * (1 - F(-x)). With a Gaussian distribution you would not have that.) This would be true for any assumed 'drawValue'. It just means that 75% obtained by one draw and one win predicts exactly the same as two wins and one loss (after 'expanding' the draw), which is a 66% score!

This conclusion is only dependent on the shape of the rating curve.
So this is clearly a major issue for computer testing, because the actual draw percentage is not close to the figure implied by the defaults. Whether this is a flaw in BayesElo or just a feature I leave to others to debate.
I am not sure what the problem is, then. If you believe the sech rating model to be correct, this is a true effect (draws are stronger evidence that the players are close in rating that wins and losses with a same total score). Any analysis based on that model should get it too. If you don't believe that, you should base the analysis on another score-vs-Elo curve.
Okay, that's a very informative post. You say that BayesElo assumes that one win and one draw predicts the same as two wins and one loss. This seems wrong to me. I would think one draw should be considered like (half a win + half a loss), not like (one win plus one loss). In other words one win and one loss are like two draws, not like one draw. At least that's the assumption of the real Elo rating system and of Elostat, as well as the way events are scored. To me it makes BayesElo suspect, although I'm open-minded on this and could be convinced otherwise. Do you really believe that this model underlying BayesElo is more correct than the normal assumption?
I don't know Bayeselo, but what's this "BayesElo assumes that one win and one draw predicts the same as two wins and one loss."? With trinomials, draws are not equivalent to anything wins or losses, only when one is approximating trinomials with binomials, draws must be modeled on wins/losses. The approximation is inherently inaccurate for large ranges of distributions. If I understood something, Bayeselo is making some assumptions about draws as related to wins/losses? I repeat, any such assumption is an approximation, a better or a worse one, depending on the concrete results.

Kai