Advantage for White; Bayeselo (to Rémi Coulom)

Edmund · Post by **Edmund** » Sun Mar 04, 2012 12:22 am

Inspired by Daniels topic "why chess has small chance of being a black win ?" I conducted some tests on the correlation between White Advantage and Elo-Average of both players.

As testdata I used the current CCRL 40/40 Database. Removing all games with incomplete Elo Information I was left with 394860 games.

Using the formulas from Bayeselo

f(Delta) = 1 / (1 + 10^(Delta/400))
P(WhiteWins) = f(eloBlack - eloWhite - eloAdvantage + eloDraw)
P(BlackWins) = f(eloWhite - eloBlack + eloAdvantage + eloDraw)
P(Draw) = 1 - P(WhiteWins) - P(BlackWins)

I searched for the maximum-likelihood values for eloAdvantage and eloDraw.

According to the webpage the default-values for Bayeselo are

eloAdvantage = 32.8
eloDraw = 97.3

In my tests I found a significant correlation with the average Elo of the two players.
I am getting the best fit with:
EloDraw = avg * 0.096 -135
EloAdvantage = avg * 0.0108 -2.4

In other words with increasing level of play the Draw Rate increases (no new information really; see http://kirill-kryukov.com/chess/kcec/draw_rate.html), but also the advantage of moving first increases.
Another interesting finding was that with increasing level of play the Probability for White Scoring (=P(white win) + P(draw) * 0.5) increases. Regarding Daniels topic I see this as an indication that chess is not won by black. Extrapolating the trends suggest that at an Elo of 8000 94.5% of the games will end in a draw, 4% are won for white and the remaining 1.5% are won for black.

Rémi Coulom, is Bayeselo still under development? If so, I would suggest to improve its output by considering the level of play of the opponents. Especially for Ratinglists like CCRL the effects would be significant. On the low end the average elo of the two engines is 1945 (with an EloDraw of 51.72) and at the high end it is 3280.5 (with EloDraw of 179.928)

hgm · Post by **hgm** » Sun Mar 04, 2012 7:56 am

Would it also easy for you to test the rating model? I.e. obtain the score and draw-rate stats as a function of Elo difference rather than Elo average, and see how well it the Logistic fits it?

Edmund · Post by **Edmund** » Sun Mar 04, 2012 9:49 am

I plotted two graphs for you.

The first displays Draw Ratio and White Score against avg. Elo. For each Datapoint (avg. Elo) there have on average 163 games been played (min=1; max=757) This explains the few outliers on 0%, 50% and 100%. The two lines indicate my MLE, where I do take into account the weighting for sample-size.

The second graph displays Draw Ratio and White Score against Elo-delta. In green I have plotted the well known elo formula P(win) = 1/(1+10^(-delta/400))

lucasart · Post by **lucasart** » Sun Mar 04, 2012 9:53 am

Thanks Edmund. Very interesting analysis.
* P(draw) increases with elo is fairly obvious, but it's always good to be able to quantify it.
* E(score|white) increases with elo: that one is less obvious.
* 8000 elo extrapolation: I'm currently working on my 8000 elo engine to verify your predictions. Almost there, just 5500 elo to go

hgm · Post by **hgm** » Sun Mar 04, 2012 11:44 am

Thanks, Edmund! This is really interesting and important data. For one, it shows that the Sonas observation, that a linear model gives a better fit than Gaussian or Logistic, holds for computer games as well.

A second point is that the draw probability vs Elo-difference looks like a parabola, i.e. it seems indeed proportional to the product score(deltaE)*(1-score(deltaE)). This indeed justifies the analysis of BayesElo, where a single draw is equivalent to one win + one loss (i.e. counts as 2 games).

Some remarks: it seems the horizontal resolution of your graphs is higher than needed, considering the vertical spread of the data points. It would be better to increase the bin size by at least a factor 4 (or apply some other form of smoothing); this would reduce the vertical spread by a factor 2, without visibly increasing horizontal uncertainty.

Perhaps you should use extra large Elo bins in the tails, where there you have only 2 or 3 data points per bin now, so that the percentages quantize to 50% or 33%, not contributing anything to visible estimation of how the curve looks. (Perhaps you should just combine bins until they have a certain minimum number of games, keeping track of average Elo (difference) as well as average score / drawRate.

I get the impression that the draw vs. average Elo graph would be better fitted by a line with a break in it around 2950, first sloping up, and then saturating at about 42%.

One would have to guard for systematic errors: the points at the high end of the average Elo scale are likely to be due to games with low Elo difference between the participants. And from the Elo difference graph we can see that the draw rate goes up if the Elo difference is small. So part of the increase of the draw rate could be due to that. So I think it would be useful to also calculate the variance of Elo difference as a function of average Elo. If there is significant variation of that across the scale, the contribution of variance to suppression of draws (as known from the second graph) could be used to correct the observed draw rate for sampling bias.

lucasart · Post by **lucasart** » Sun Mar 04, 2012 12:06 pm

hgm wrote: I get the impression that the draw vs. average Elo graph would be better fitted by a line with a break in it around 2950, first sloping up, and then saturating at about 42%.

Be careful, the green curve is effectively the logistic distribution. If you replace it with another, it must still be a differentiable function.

But yes, this shows that the logistic has excessively fat tails. Perhaps one could find a compromise between gaussian and logistic...

hgm · Post by **hgm** » Sun Mar 04, 2012 12:29 pm

I think you are talking about the other graph. Why should the curve be differential, or even continuous? Player strength could very well be quantized. Especially for computer players.

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 04, 2012 1:01 pm

Hi Edmund. Great work as usual.

Another interesting finding was that with increasing level of play the Probability for White Scoring (=P(white win) + P(draw) * 0.5) increases. Regarding Daniels topic I see this as an indication that chess is not won by black. Extrapolating the trends suggest that at an Elo of 8000 94.5% of the games will end in a draw, 4% are won for white and the remaining 1.5% are won for black.

Indeed the EloAdvantage of white should not be constant. Only masters and gms would be able to exploit that well.BayesElo of weak players playing black may be slightly inflated as a result.
Draws really dominate chess at high elos. I bet most of them could be where white is up some but can't really force a win. Infact Kaufman claimed after e4 / d4 white gets an advantage that persists through the endgame. So if that persists throughout, then the high draw ratio should not be due to black making some compensations for white advantage. But due to rules that allow you to draw even when you are in disadvantage material wise.
Hex is really simple to prove as white win

Code: Select all

No draws &#40;only one side can connect&#41;
+
Strategy stealing &#40;A move can only improve your position i.e no zugzwang&#41;
=
white wins!

---
cheers

Rémi Coulom · Post by **Rémi Coulom** » Sun Mar 04, 2012 2:29 pm

Thanks Edmund for your interesting data.

I certainly won't have time to modify bayeselo in the near or even not-so-near future. But bayeselo lets you set the DrawElo and EloAdvantage. You can also find the maximum-likelihood values for these parameters based on your data. That should be good enough when testing programs that don't have a huge difference in playing strength.

Another important question is the model for the draw distribution. I tried to compare a few different models some time ago, but could not reach a clear conclusion.

Comparing frequencies with what the model predicts is interesting, but it should be considered carefully. If the frequency does not match the model, then it is a sign that the model is bad. But if it does match the model, it does not mean that the model is good, because the ratings were computed with the model in the first place.

The only really proper way to compare models is to measure the quality of predictions they make. The main problem is that this approach requires huge amounts of data in order to be able to make statistically significant comparisons.

Rémi

hgm · Post by **hgm** » Sun Mar 04, 2012 3:20 pm

Rémi Coulom wrote:If the frequency does not match the model, then it is a sign that the model is bad. But if it does match the model, it does not mean that the model is good, because the ratings were computed with the model in the first place.

I am not sure I buy that. The number of games is so large, that splitting the data set in two and derive ratings from that would not give significantly different predictions for the ratings. And the other half of the data set would be good enough to define the empirical curve using those derived ratings.

If these empirical frequencies then match the model prediction, the model is by definition perfect. Because that was all the model was supposed to do: derive ratings that could be used to accurately predict frequencies.

Advantage for White; Bayeselo (to Rémi Coulom)

Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Re: Advantage for White; Bayeselo (to Rémi Coulom)