Empirically 1 win + 1 loss ~ 2 draws

Laskos · Post by **Laskos** » Tue Jun 24, 2014 8:36 pm

There are databases of computer chess games that can prove it too, but they are messy, each point has large uncertainty, and Elo span is usually small in direct matches, where shorter (Rao-Kupper) or longer (Davidson) tails of distributions are tested. I performed some more clinical tests, using only 4 engines in 5 runs and 9-11 datapoints for each run, but with many games (1000 games each datapoint).

There are clearly two competing models:

Code: Select all

Rao-Kupper&#58; 
d = C*w*&#40;1 - w - d&#41;
d -> &#40;C*w - C*w^2&#41;/&#40;1 + C*w&#41;

Code: Select all

Davidson&#58;
d^2 = C*w*&#40;1 - w - d&#41;
d -> 1/2 (-C*w + Sqrt&#91;C&#93;*Sqrt&#91;w&#93;*Sqrt&#91;4 - 4 w + C*w&#93;)

The model to fit the data is

Code: Select all

d^a = C*w&#40;1 - w - d&#41;

where a and C are now variables to fit the datapoints, a=1 being Rao-Kupper model (used in BayesElo) and a=2 being the Davidson model.
The Elo span in matches is 1000-1500 Elo points, so tails are visible.

Code: Select all

Anchor&#58; SF depth 6
Komodo depths&#58;

K1                            &#58; 1000 (+986,= 12,-  2&#41;, 99.2 % 
K2                            &#58; 1000 (+930,= 60,- 10&#41;, 96.0 % 
K3                            &#58; 1000 (+837,= 95,- 68&#41;, 88.4 % 
K4                            &#58; 1000 (+663,=215,-122&#41;, 77.0 % 
K5                            &#58; 1000 (+488,=207,-305&#41;, 59.2 % 
K6                            &#58; 1000 (+192,=248,-560&#41;, 31.6 % 
K7                            &#58; 1000 (+ 84,=168,-748&#41;, 16.8 % 
K8                            &#58; 1000 (+ 18,=108,-874&#41;,  7.2 % 
K9                            &#58; 1000 (+ 14,= 46,-940&#41;,  3.7 % 
K10                           &#58; 1000 (+  2,= 26,-972&#41;,  1.5 %

model&#58; d^a = C*w*&#40;1-d-w&#41;
Least Squares&#58;
a = 1.955 
C = 0.426 



Anchor&#58; IvanHoe depth 5
Komodo depths&#58;

K1                            &#58; 1000 (+982,= 14,-  4&#41;, 98.9 %
K2                            &#58; 1000 (+946,= 39,- 15&#41;, 96.5 %
K3                            &#58; 1000 (+834,= 99,- 67&#41;, 88.3 %
K4                            &#58; 1000 (+630,=214,-156&#41;, 73.7 %
K5                            &#58; 1000 (+440,=249,-311&#41;, 56.5 %
K6                            &#58; 1000 (+173,=248,-579&#41;, 29.7 %
K7                            &#58; 1000 (+ 55,=169,-776&#41;, 14.0 %
K8                            &#58; 1000 (+ 19,=102,-879&#41;,  7.0 %
K9                            &#58; 1000 (+  4,= 34,-962&#41;,  2.1 %

model&#58; d^a = C*w*&#40;1-d-w&#41;
Least Squares
a = 1.639 
C = 0.818



Anchor&#58; SF depth 5
Houdini depths&#58;

H1                            &#58; 1000 (+907,= 79,- 14&#41;, 94.7 %
H2                            &#58; 1000 (+815,=148,- 37&#41;, 88.9 %
H3                            &#58; 1000 (+594,=293,-113&#41;, 74.1 %
H4                            &#58; 1000 (+403,=375,-222&#41;, 59.1 %
H5                            &#58; 1000 (+203,=413,-384&#41;, 40.9 %
H6                            &#58; 1000 (+ 86,=312,-602&#41;, 24.2 %
H7                            &#58; 1000 (+ 30,=243,-727&#41;, 15.2 %
H8                            &#58; 1000 (+ 18,=191,-791&#41;, 11.3 %
H9                            &#58; 1000 (+  3,= 95,-902&#41;,  5.1 %

model&#58; d^a = C*w*&#40;1-d-w&#41;
Least Squares&#58;
a = 1.941 
C = 1.771



Anchor&#58; Komodo depth 5
Houdini depths&#58;

H1                            &#58; 1000 (+928,= 67,-  5&#41;, 96.2 %
H2                            &#58; 1000 (+878,=106,- 16&#41;, 93.1 %
H3                            &#58; 1000 (+724,=211,- 65&#41;, 83.0 %
H4                            &#58; 1000 (+524,=301,-175&#41;, 67.5 %
H5                            &#58; 1000 (+310,=396,-294&#41;, 50.8 %
H6                            &#58; 1000 (+139,=385,-476&#41;, 33.1 %
H7                            &#58; 1000 (+ 54,=327,-619&#41;, 21.8 %
H8                            &#58; 1000 (+ 22,=252,-726&#41;, 14.8 %
H9                            &#58; 1000 (+  5,=113,-882&#41;,  6.2 %

model&#58; d^a = C*w*&#40;1-d-w&#41;
Least Sqaures&#58;
a = 2.476
C = 0.931



Anchor&#58; Houdini depth 5
SF depths&#58;

S1                            &#58; 1000 (+925,= 72,-  3&#41;, 96.1 %
S2                            &#58; 1000 (+883,=109,-  8&#41;, 93.8 %
S3                            &#58; 1000 (+762,=208,- 30&#41;, 86.6 %
S4                            &#58; 1000 (+620,=311,- 69&#41;, 77.5 %
S5                            &#58; 1000 (+405,=394,-201&#41;, 60.2 %
S6                            &#58; 1000 (+241,=380,-379&#41;, 43.1 %
S7                            &#58; 1000 (+117,=315,-568&#41;, 27.5 %
S8                            &#58; 1000 (+ 41,=235,-724&#41;, 15.8 %
S9                            &#58; 1000 (+  5,=143,-852&#41;,  7.6 %
S10                           &#58; 1000 (+  7,= 72,-921&#41;,  4.3 %
S11                           &#58; 1000 (+  0,= 49,-951&#41;,  2.5 %

model&#58; d^a = C*w*&#40;1-d-w&#41;
Least Sqaures&#58;
a = 2.167
C = 1.457

We have 5 values for a: {1.955, 1.639, 1.941, 2.476, 2.167} with an average of a=2.036. VERY close to the Davidson model with a=2 draws. Rao-Kupper (a=1) is ruled out.

All in all, the Davidson model P(D|E) = C*Sqrt[P(W|E)*P(L|E)], which assumes that 1 win + 1 loss = 2 draws fits much better the empirical data than Rao-Kupper model used in BayesElo P(D|E) = C*P(W|E)*P(L|E), where 1 win + 1 loss = 1 draw. So, it's not advisable to use BayesElo for computer chess ratings. The model assumed in Ordo is 1 win + 1 loss = 2 draws, so it would be advisable to use Ordo in computer chess ratings instead of BayesElo.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 24, 2014 9:12 pm

What the heck Kai, you basically tried to do mine and Remi's study again, botched it completely, but still came to the same conclusion. You have to do a cross-correlation test to know which model is better. That is very important because all models will fit ratings calculated by itself much better than others. So it is meaningless to use least squares fit, as there is not an absolute true rating of engines. We compared Rao-Kapper, Davidson and Glenn-David (a model you haven't included). DV came out clearly better than RK but not so much from GD. Here is a link to the paper we wrote but never got published. https://dl.dropboxusercontent.com/u/552 ... tcomes.pdf
If you want to do more research, go from there and improve up on it , not redo everything and fail at that

Laskos · Post by **Laskos** » Tue Jun 24, 2014 9:46 pm

Daniel Shawul wrote:What the heck Kai, you basically tried to do mine and Remi's study again, botched it completely, but still came to the same conclusion. You have to do a cross-correlation test to know which model is better. That is very important because all models will fit ratings calculated by itself much better than others. So it is meaningless to use least squares fit, as there is not an absolute true rating of engines. We compared Rao-Kapper, Davidson and Glenn-David (a model you haven't included). DV came out clearly better than RK but not so much from GD. Here is a link to the paper we wrote but never got published. https://dl.dropboxusercontent.com/u/552 ... tcomes.pdf
If you want to do more research, go from there and improve up on it , not redo everything and fail at that

I didn't use ratings, only draw models. What was there to botch? One goes that rate of draws is roughly proportional to rate of wins on the left tail, one goes with the square root. What was confirmed was the square root. This is Davidson. Sure, using ratings, each model predicts its own results. By the way, correlation might be misleading due to irrelevance of a general factor and a shift.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 25, 2014 1:32 am

Laskos wrote:
Daniel Shawul wrote:What the heck Kai, you basically tried to do mine and Remi's study again, botched it completely, but still came to the same conclusion. You have to do a cross-correlation test to know which model is better. That is very important because all models will fit ratings calculated by itself much better than others. So it is meaningless to use least squares fit, as there is not an absolute true rating of engines. We compared Rao-Kapper, Davidson and Glenn-David (a model you haven't included). DV came out clearly better than RK but not so much from GD. Here is a link to the paper we wrote but never got published. https://dl.dropboxusercontent.com/u/552 ... tcomes.pdf
If you want to do more research, go from there and improve up on it , not redo everything and fail at that
I didn't use ratings, only draw models. What was there to botch? One goes that rate of draws is roughly proportional to rate of wins on the left tail, one goes with the square root. What was confirmed was the square root. This is Davidson. Sure, using ratings, each model predicts its own results. By the way, correlation might be misleading due to irrelevance of a general factor and a shift.

Too many faults in your set up then: correlation, small data set, wrong methodology, and not computing ratings at all ...?? The last one is especailly funny because your conclusion is about rating tools. Like I said, reporting results of least squares regression is meaningless. One can easily pick up data that can fit one model better, and infact an overfitting model may perform badly when you validate it later with a data that it hasn't visited yet... That is why you need to do a cross-validation test to really compare the draw models. You compute the parameters on parts of the data, and then test the model on the rest of the data that the computed model have no idea of (its parameters were not computed from it). We did a 10-fold cross validation tests on CEGT,CCRL 40/40 and blitz huge data sets to reach a conclusion.

Also you should atleast report a goodness-of-fit result (xhi-square?), which I did the first time I tried this. Fix a to 2 and 1, and do regression on c, and report result with goodness of fit. You can even see on the small data that you have provided the value of a (which is an exponent!!) is no where close to being stable. You should probably regress on a log if you haven't done so...

And another, the regression should not be done on the WHOLE data set like you did. Note the Delta in P(W/Delta), P(L/Delta) and P(D/Delta). The formulas and regression are done for two opponents with a given difference in strength. You substituted that with average winning ratio for all the players, discarding the all important rating difference in the process. I don't even know anymore what to make of your test result ....

Laskos · Post by **Laskos** » Wed Jun 25, 2014 9:30 am

Daniel Shawul wrote:
Laskos wrote:
Daniel Shawul wrote:What the heck Kai, you basically tried to do mine and Remi's study again, botched it completely, but still came to the same conclusion. You have to do a cross-correlation test to know which model is better. That is very important because all models will fit ratings calculated by itself much better than others. So it is meaningless to use least squares fit, as there is not an absolute true rating of engines. We compared Rao-Kapper, Davidson and Glenn-David (a model you haven't included). DV came out clearly better than RK but not so much from GD. Here is a link to the paper we wrote but never got published. https://dl.dropboxusercontent.com/u/552 ... tcomes.pdf
If you want to do more research, go from there and improve up on it , not redo everything and fail at that
I didn't use ratings, only draw models. What was there to botch? One goes that rate of draws is roughly proportional to rate of wins on the left tail, one goes with the square root. What was confirmed was the square root. This is Davidson. Sure, using ratings, each model predicts its own results. By the way, correlation might be misleading due to irrelevance of a general factor and a shift.
Too many faults in your set up then: correlation, small data set, wrong methodology, and not computing ratings at all ...?? The last one is especailly funny because your conclusion is about rating tools. Like I said, reporting results of least squares regression is meaningless. One can easily pick up data that can fit one model better, and infact an overfitting model may perform badly when you validate it later with a data that it hasn't visited yet... That is why you need to do a cross-validation test to really compare the draw models. You compute the parameters on parts of the data, and then test the model on the rest of the data that the computed model have no idea of (its parameters were not computed from it). We did a 10-fold cross validation tests on CEGT,CCRL 40/40 and blitz huge data sets to reach a conclusion.

Also you should atleast report a goodness-of-fit result (xhi-square?), which I did the first time I tried this. Fix a to 2 and 1, and do regression on c, and report result with goodness of fit. You can even see on the small data that you have provided the value of a (which is an exponent!!) is no where close to being stable. You should probably regress on a log if you haven't done so...

And another, the regression should not be done on the WHOLE data set like you did. Note the Delta in P(W/Delta), P(L/Delta) and P(D/Delta). The formulas and regression are done for two opponents with a given difference in strength. You substituted that with average winning ratio for all the players, discarding the all important rating difference in the process. I don't even know anymore what to make of your test result ....

That's how initially I did, fixed a=1 and a=2 and performed a regression on C (least squares) for the following:

Code: Select all

Rao-Kupper&#58; 
d = C*w*&#40;1 - w - d&#41;
d -> &#40;C*w - C*w^2&#41;/&#40;1 + C*w&#41;

Code: Select all

Davidson&#58;
d^2 = C*w*&#40;1 - w - d&#41;
d -> 1/2 (-C*w + Sqrt&#91;C&#93;*Sqrt&#91;w&#93;*Sqrt&#91;4 - 4 w + C*w&#93;)

The better fit of Davidson was visible with naked eye. I may do a chi-square goodness of fit result, but it's pretty tedious to re-enter my data for that. For the the rest, I don't know what you are blabbering there.

The important thing is that I brought nothing new, you and Remi already showed that 1 win + 1 loss = 2 draws in computer chess ratings several months ago. I thought that you stand by position that it's unknown which draw model fits the data. So, do you admit that BayesElo, as it is now, uses a wrong draw model, which doesn't fit the empirical data in computer chess?

hgm · Post by **hgm** » Wed Jun 25, 2014 9:40 am

The problem is that the draw model used by Bayes Elo follows in a sort of plausible way from the total rating model. Namely that the actual sttrenght differerence has to exceed a certain threshold before it turns into a win. In this case the draw probability becomes proprtional to the derivative of the rating distribution.

So if you see a different relation between win, draw and loss probability than the Logistic model predicts, it is likely that the Logistic model is no good. It seems that the weight of draws would be the least of your problems, in that case.

Laskos · Post by **Laskos** » Wed Jun 25, 2014 10:19 am

hgm wrote:The problem is that the draw model used by Bayes Elo follows in a sort of plausible way from the total rating model. Namely that the actual sttrenght differerence has to exceed a certain threshold before it turns into a win. In this case the draw probability becomes proprtional to the derivative of the rating distribution.

This "plausible way" with the derivative gives for Logisitc automatically d=C*w*l, which is empirically proven to be wrong (too bad I was too late). Davidson also uses Logistic, although I don't know what "plausible explanation" he gives for the draw model. It fits the empirical data. Also, "plausible" is Glenn-David model too. The distribution of a sum of independent and identically distributed random values has the shape of a Gaussian, so it assumes accumulation of small errors for the draw model.

So if you see a different relation between win, draw and loss probability than the Logistic model predicts, it is likely that the Logistic model is no good. It seems that the weight of draws would be the least of your problems, in that case.

hgm · Post by **hgm** » Wed Jun 25, 2014 10:24 am

Laskos wrote:The distribution of a sum of independent and identically distributed random values has the shape of a Gaussian.

Not necessarily. E.g. when the independent variables had a Cauchy distribution.

Laskos · Post by **Laskos** » Wed Jun 25, 2014 10:43 am

hgm wrote:
Laskos wrote:The distribution of a sum of independent and identically distributed random values has the shape of a Gaussian.
Not necessarily. E.g. when the independent variables had a Cauchy distribution.

Ok, with finite variance. Cauchy has an undefined variance.

hgm · Post by **hgm** » Wed Jun 25, 2014 10:56 am

Indeed. But individual errors in games do not have limited impact. It can very well be that games in general are dominated by the worst error.

Empirically 1 win + 1 loss ~ 2 draws

Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws

Re: Empirically 1 win + 1 loss ~ 2 draws