Advantage for White; Bayeselo (to Rémi Coulom)

hgm · Post by **hgm** » Sun Mar 04, 2012 5:33 pm

Edmund wrote:I cannot easily recalculate the Elos; I have taken the data from CCRL.

Too bad. In what form do you have the data? Is it a PGN of complete games? Would it be possible to reduce it to just player and result tags? (That would be enough to feed it to BayesElo).

Taking into account the elo-delta of the players my model predicts the game outcome on average 0.15 percent-points better.

0.15%?

How do you calculate that? In places (e.g. around +200 delta-Elo) the difference between the green curve and the data points is as much as 5% points, and the statistical noise (based on spread of the points) in that very much lower. Is it just that there are comparatively few games in those points, and a huge number of games in the points from -20 to +20 Elo?

It seems optically clear to me that you can improve the fit (i.e. reduce the 0.15% error) by scaling the ratings up by some 15-20%. I.e. use

Elo(x) = 1 / (1 + 10^(1.2*x/400))

to predict the score.

Rémi Coulom · Post by **Rémi Coulom** » Sun Mar 04, 2012 6:13 pm

hgm wrote:
Rémi Coulom wrote:Even for one-dimensional models, I can imagine distributions of player ratings that have no bias in predicting the winning frequency, but produce poor predictions.
Not sure what you mean by that. What else is there to predict on a game than the winning frequency? Do you mean it might predict the winning frequency against a group of players, but not against the individual players of that group?

I mean that it is nice to have an unbiased estimator of the probability of winning, but it does not necessarily produce the best predictions. For instance, if you have a formula that produces an unbiased estimate of the probability of winning as a function as the rating difference between players, you might still beat the quality of prediction of that unbiased estimator by using another model that takes the mean rating of players as an additional parameter. Since it seems that the probability of draws increases with rating, that more advanced model might produce better predictions than your simple unbiased model.

So a model can be unbiased, and still can be improved in terms of prediction quality.

Prediction quality should be measured on data that were not used for computing the ratings. It can be measured by the average log-probability of results, for instance.

Rémi

Edmund · Post by **Edmund** » Sun Mar 04, 2012 6:18 pm

I downloaded the full pgn. Then used some script to strip everything away but the game_id, result and both elos. I then imported the data to excel where I am manipulating the data now.

Abs(Elo-delta) vs. Number of games looks like this:

hgm · Post by **hgm** » Sun Mar 04, 2012 6:27 pm

OK, I see what you mean. By adding more parameters, one can always get a better fit. I was just looking for the best possible model that only takes rating difference into account.

I agree that in general predictions should better not be tested on the data they are derived from, because it will make you err into the direction of thinking that they are better than they really are. But for a very large data set it hardly matters. (E.g. the N/(N-1) correction you need for variances computed to a mean derived from the points itself, or from an independently given one.)

In that light, what do you think of the fact that the data points seem to stipulate a steeper-rising curve than the green logistic on which they are supposed to be based? Is this an indication that BayesElo's default approach is not the optimal way to extract the ratings?

Rémi Coulom · Post by **Rémi Coulom** » Sun Mar 04, 2012 6:32 pm

hgm wrote:In that light, what do you think of the fact that the data points seem to stipulate a steeper-rising curve than the green logistic on which they are supposed to be based? Is this an indication that BayesElo's default approach is not the optimal way to extract the ratings?

I am not really sure, but the steeper-rising curve might be an effect of the prior.

Rémi

Adam Hair · Post by **Adam Hair** » Sun Jul 01, 2012 1:00 am

hgm wrote:It is interesting that the Gaussian seems to give a better fit, despite the fact that the ratings were derived using the logistic. (Now hope the logistic doesn't give a better fit on ratings derived with the Gaussian...) It could be that for this large data set based mostly on low delta-Elo data the obtained ratings are not very sensitive to the model used.

What worries me is that the empirical data seems steeper than the curve of the model from which they were derived. That means that Elo differences you have to put into the logistic formula to get the true score percentage are larger than those spit out by BayesElo. In other words, BayesElo systematically underestimates rating differences, compressing the rating scale.

I wonder if this is an artifact caused by the prior. Do you calculate the ratings for this data set yourself? If so, could you recalculate them using a smaller prior (e.g. 0.1 in stead of the standard 2.0)?

I think that the compression is due more to the use of the default eloAdvantage and eloDraw values.

For the CCRL 40/40 database that Edmund was studying, the Elo ratings were computed using the default values (prior=2; eloAdvantage=32.8; eloDraw=97.3). The spread (Elo max - Elo min) of the Elo values is 1392. If, instead of prior=2, prior is set to 0.1, then the spread is 1400. A 0.57% difference. However, if the advantage and eloDraw are computed from the database (eloAdvantage=32.967 (not much difference); eloDraw=137.113 (big difference), then the spread with prior=2 becomes 1498, which is a 7.6% difference. Making prior=0.1 also makes the spread 1505 (8.1%).

Adam Hair · Post by **Adam Hair** » Sun Jul 01, 2012 5:48 am

Edmund wrote:Thanks to hgm and lucas for the suggestions.

Below you find updated graphs representing the same data.
1) bin-size is 4 elo-points
2) minimum bin-size is 4 samples
3) in the elo-delta graph I shifted all models by 20 elo points to compensate for the white to move advantage
4) I added the cdf of the normal distribution with sd=250
5) I added the function hgm suggested to estimate draws scaling by 40/25

Agreed, the gauss function is a better fit than either the linear function or the logistic function.
Looking at the new graphs I am not so sure about hgms suggestion regarding the progression of the avg-elo score function. You are right that the next step is to take elo-delta into the equation.

I have replicated Edmond's methods with the CCRL 40/4 database. I too used 4 Elo bins. I do have 4 bins with less than 4 samples, but I do have 294 bins with the mean number of samples of 2946 and only 9 samples total with less than 10 samples. With the weighted regression, I judge that these 4 bins have little effect. I shifted the models by 26.973 (which is the computed eloAdvantage). I also have compared the logistic model with a Gaussian cdf, though I have been forced to use an approximation. The site I am using to perform the regressions does not recognize erf(). However, the approximation is quite good (1/(1 + exp(-0.07056*(X**3)-1.5976*X))).

This is White Score versus Average Elo:

The regression equation is White Score = 46.27 + 0.00285 Average Elo

This is Draw Ratio versus Average Elo:

The regression equation is Draw Ratio = -17.23 + 0.0179 Average Elo . I checked to see if the outliers exerted much leverage on the regression line by trimming the points outside the interval (1900, 3100). The confidence interval for the slope of the regression line for the trimmed data includes 0.0179. Therefore, given the weighted data, I believe that the draw ratio regression line is not affected much by the outliers.

Here is Draw Ratio vs Elo Delta (Elo Diff):

Finally, here is White Score vs Elo Delta:

The logistic equation is 100/(1+10^((x+26.973)/~380)).

This is the data with the Gaussian model:

sd=278.18. The approximation used was 100/(1 + exp((-0.07056*(((X+26.973)/278.18)**3)-1.5976*((X+26.973)/278.18))))

The two equations model the the data equally well. The logistic model is compressed ~5%. I believe this is related to the fact that the eloDraw computed from the data is 102.647, which is slightly higher than the default value.

I believe my next step is to recompute the Elo ratings for the 40/40 database and examine the cause for its compression. It is compressed noticeably more than the 40/4 ratings. Any results will be more definitive with this database. As I said in the previous post, I believe the cause is the use of the default eloDraw value (the default eloAdvantage causes the shift). It will take me a day or two to do this. There is one part of the data extraction that I have to do by hand, and it took several hours to do this for the 40/4 database.

mcostalba · Post by **mcostalba** » Sun Jul 01, 2012 9:17 am

hgm wrote: A second point is that the draw probability vs Elo-difference looks like a parabola, i.e. it seems indeed proportional to the product score(deltaE)*(1-score(deltaE)). This indeed justifies the analysis of BayesElo, where a single draw is equivalent to one win + one loss (i.e. counts as 2 games).

I am not an expert but it seems to me there is an asymmetry in this graph.

It is like when white is stronger the draw ratio quickly decrease, while instead if black is stronger the draw ratio keeps around the maximum for longer.

It is like a weaker player, if playing with white pieces is able to keep the draw better than the opposite case.

IOW draw probability vs Elo-difference is correlated to stronger player's color.

Adam Hair · Post by **Adam Hair** » Sun Jul 01, 2012 10:47 pm

The following is a graph of the CCRL 40/40 data, prepared exactly how Edmund did. However, I have recomputed the ratings to take into account the eloAdvantage and eloDraw as computed from these games. I left prior equal to 2.

The equation for the logistic model seen in the graph is:

White Score = 100/(1+10^(-1.028*(X+32.976)/400))

where eloAdvantage = 32.976. Also, R²=0.972 for this model.

I guess if the prior was adjusted down, the 2.8% compression might completely go away. However, I believe that this shows the discrepancy between the the CCRL Elo ratings and the Bayeselo model is due to the fact that the default values for eloAdvantage and eloDraw are used to compute the ratings. If the computed values are used (which act like location and scale parameters for White Score), the resulting logistics model matches the data quite well (as it should).

Edmund · Post by **Edmund** » Sun Jul 01, 2012 11:55 pm

Interesting follow-up, Adam. Thanks for sharing.

I never played around much with the settings of bayeselo, so I wasn't aware what could be achieved by chainging the parameters and recaluclating the elo values.

You also showed that eloadvantage and elodraw are correlated with absolute elo, so the whole model could benefit from making these parameters dynamic.

Is CCRL planning to change its model to use your new adjustment parameters?

Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)