Bayeselo for players of time-varying strength

Rémi Coulom · Post by **Rémi Coulom** » Wed Apr 09, 2008 11:46 am

Hi,

If some of you are interested in rating algorithms, here is my CG2008 paper:
http://remi.coulom.free.fr/WHR/

Title: Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength

Abstract: Whole-History Rating (WHR) is a new method to estimate the time-varying strengths of players involved in paired comparisons. Like many variations of the Elo rating system, the whole-history approach is based on the dynamic Bradley-Terry model. But, instead of using incremental approximations, WHR directly computes the exact maximum a posteriori over the whole rating history of all players. This additional accuracy comes at a higher computational cost than traditional methods, but computation is still fast enough to be easily applied in real time to large-scale game servers (a new game is added in less than 0.001 second). Experiments demonstrate that, in comparison to Elo, Glicko, TrueSkill, and decayed-history algorithms, WHR produces better predictions.

Rémi

hgm · Post by **hgm** » Wed Apr 09, 2008 2:02 pm

Perhaps a bit off topic, but I am still worried a lot more by the fact that BayesElo does give extremely poor predictions when you feed it with the data from the ChessWar Promo division with prior=2, than by any problems with time-dependent player strength.

But this brings up the following fundamental issue, which you refer to here as well: what exactly do you mean by "better predictions", and how do you evaluate that? Am I correct in assuming that you take a sample of games between the rated players, calculate their expected total score and variance of this score from the assigned ratings and the underlying rating model, and then calculate the sum of squares of the difference between the actual (sampled) scores and the predicted scores, weighted by the inverse variance of those scores as a measure of quality of the prediction?

Would a prediction method be "better" if it has a lower expectation value of this sum of squares, calculated over some assumed sampling probability? (e.g. that each possible set of pairings of pairings of the players is equally likely to be selected, so that nothing can be gained by trying to get the rating of some players more accurate, possibly at the expense of less accuracy for some other players.)

Rémi Coulom · Post by **Rémi Coulom** » Wed Apr 09, 2008 2:51 pm

hgm wrote:Perhaps a bit off topic, but I am still worried a lot more by the fact that BayesElo does give extremely poor predictions when you feed it with the data from the ChessWar Promo division with prior=2, than by any problems with time-dependent player strength.

Is this what we talked about by e-mail, or something else ? I thought that discussion was settled, but I would think again about it if not.

But this brings up the following fundamental issue, which you refer to here as well: what exactly do you mean by "better predictions", and how do you evaluate that?

I evaluate by computing the frequency at which a game is won by the player with the highest rating.

Am I correct in assuming that you take a sample of games between the rated players, calculate their expected total score and variance of this score from the assigned ratings and the underlying rating model, and then calculate the sum of squares of the difference between the actual (sampled) scores and the predicted scores, weighted by the inverse variance of those scores as a measure of quality of the prediction?

This would be another possibility. I spent a lot of time thinking about the best method. The method I used focuses more on ordering players correctly, rather than predicting winning frequencies.

An important aspect here, is that the dynamic Bradley-Terry model is wrong. The KGS data that I used mixes very strong players, and very weak players. All the algorithms I tested assume that all the players have the same rating variability. But this is wrong. The consequence is that the ratings of beginners are "compressed". That is to say, they are ranked in the right order, but the differences in ratings are smaller than they should.

I also considered using mean log-evidence as an indicator of performance. That is to say, the geometric average of probabilities given to game outcomes. But it does not work, because of outliers. Again, this is because the model is wrong.

When the model is wrong (almost always in Bayesian inference), it is difficult to make any conclusion about the quality of the inference system by measuring how well frequencies match the model. I am not completely happy with my criterion, but I feel that it is rather robust and universal. It could even be applied to rating systems that don't predict outcome probabilities.

Rémi

hgm · Post by **hgm** » Wed Apr 09, 2008 4:06 pm

Yes, it is the same issue as we discussed by e-mail. But this discussion did not really solve my problem, but more or less got stuck because we seemed to differ in the fundamental justification of th formula used in BayesElo, in particular if one should use the rating estimates that produce maximum-likelihood of the observed results, or if one should use the expectation value of the ratings under the Bayesian likelihood. The issue was further obfuscated by the fact that for the simplified case of two players both approaches in the end produce the same expression, given another prior distribution.

To resolve the issue, we really should have an objective definition of the purpose of the rating estimates, to be able to quantify what difference measure for predicted and observed results it is we want to optimize.

Having a correct ordering is indeed a completely different matter than having correct win probabilitiees. I have no issue with the ordering of the programs from teh ChessWar Promo. I just noticed that in the first round, when the top half plays the bottom half, the Bayesian ratings on the average differed by ~500, while the result of 80 games (which should be enough to be significant) was more like they differed ~1000. This is the compression of the rating scale you refer to.

I still have not found a satisfactory procedure to prevent this compression. (It is also a problem in my "Battle of the Goths 10x8 Championship, where the ratings run over a wide range.) The work-around is to set the prior very low (0.1 or so), but this makes you also lose most advantages of the Bayesian method. So I am still looking for a better method.

Rémi Coulom · Post by **Rémi Coulom** » Wed Apr 09, 2008 5:25 pm

hgm wrote:Yes, it is the same issue as we discussed by e-mail. But this discussion did not really solve my problem, but more or less got stuck because we seemed to differ in the fundamental justification of th formula used in BayesElo, in particular if one should use the rating estimates that produce maximum-likelihood of the observed results, or if one should use the expectation value of the ratings under the Bayesian likelihood.

I don't remember that discussion very well, but I would agree that the expectation value is better than maximum likelihood. The problem is I don't know how to compute it.

Also, this is if you really must represent strength by one number. It is better to always consider the whole distribution.

Rémi

Dann Corbit · Post by **Dann Corbit** » Wed Apr 09, 2008 11:25 pm

Rémi Coulom wrote:
hgm wrote:Yes, it is the same issue as we discussed by e-mail. But this discussion did not really solve my problem, but more or less got stuck because we seemed to differ in the fundamental justification of th formula used in BayesElo, in particular if one should use the rating estimates that produce maximum-likelihood of the observed results, or if one should use the expectation value of the ratings under the Bayesian likelihood.
I don't remember that discussion very well, but I would agree that the expectation value is better than maximum likelihood. The problem is I don't know how to compute it.

Also, this is if you really must represent strength by one number. It is better to always consider the whole distribution.

Rémi

I suggest that the solution is to run a series of experiments.

Make predictions based on Remi's model.
Make predictions based on any competing model.
See which predictions are more accurate.
The entire value of the Elo model is to:
1. Create rankings to separate a set of players by ability and classify them.
2. To predict (in a broad sense) what the outcome will be when players of a given known category compete against players of a different known category (but of course it will not predict the exact outcome of any single event).

The model that predicts these things best is the best model. If there is some other goal that I have not thought of, and the alterative model predicts the outcomes better, then we could consider the alternative model as superior for that purpose.

hgm · Post by **hgm** » Thu Apr 10, 2008 9:29 am

This is what you would do to assess the validity of the underlying model (i.e. win / loss / draw probability vs Elo difference).

But the problem I noticed is not with the validity of the model, but with the ability of the method to extract the ratings well from a set of data that was generated according to a given model (the same model as assumed by BayesElo) with given ratings.

And for alternative extraction methods, the expected error can be can determined experimentally, but it can most likely also simply be calculated, and it that case I would prefer the ltter.

Vinvin · Post by **Vinvin** » Sat May 24, 2008 7:36 pm

Rémi Coulom wrote:Hi,

If some of you are interested in rating algorithms, here is my CG2008 paper:
http://remi.coulom.free.fr/WHR/

Title: Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength
...
Rémi

I'm curious to see the ratings from a > 1 million games database (20 games minimum per player)

Let's show the top 20 !!

C U

Bayeselo for players of time-varying strength

Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength

Re: Bayeselo for players of time-varying strength