Ordo vs. Bayeselo

Modern Times · Post by **Modern Times** » Mon Oct 01, 2012 5:20 pm

Daniel Shawul wrote:
Modern Times wrote:
Daniel Shawul wrote:Kai, I think you simply don't want to accept the need for scaling. Here is my post trying to convince CCRL to go back to using calculated scale.
CCRL are dropping "scale 1" and going back to default scaling.
Thanks Ray for confirming that.

God knows if it is the right thing to do or not. I'm not qualified to judge myself, and even the experts don't agree.

Michel · Post by **Michel** » Mon Oct 01, 2012 5:23 pm

I see a problem with the "default" ratings. 200 points difference in Bayeselo default ratings do not predict 75% performance

Stop spouting this nonsense. I just did the calculation for drawelo=200 (fairly large) and the default scale 0.730126. Then one obtains that a 200 point rating difference computed by BayesElo corresponds to a score of 0.7717 which for the logistic distribution corresponds to 210 elo which is close enough for such a high elo difference.

For smaller elo differences or smaller values of drawelo the differences between BayesElo and logistic are almost invisible.

I added a graph for the elo/score relation for default scaled BayesElo (red) and logistic (blue) for drawelo=200.

http://hardy.uhasselt.be/Toga/elo_score.png

Now I am wondering why I am wasting all this precious my time on you since I already said that the logistic and the scaled bayeselo performance graph are almost the same... You simply refuse to read.

Daniel Shawul · Post by **Daniel Shawul** » Mon Oct 01, 2012 5:31 pm

Modern Times wrote:
Daniel Shawul wrote:
Modern Times wrote:
Daniel Shawul wrote:Kai, I think you simply don't want to accept the need for scaling. Here is my post trying to convince CCRL to go back to using calculated scale.
CCRL are dropping "scale 1" and going back to default scaling.
Thanks Ray for confirming that.

God knows if it is the right thing to do or not. I'm not qualified to judge myself, and even the experts don't agree.

Then why do you change it? Indeed you are not qualified to judge. Let the author of the tool make the judgement. Sure he is the expert of his tool , no? Also one would expect CCRL to have been loyal since one of the points of CEGT - CCRL split was to use bayeselo efficiently? CEGT still uses EloStat. I for one think you made the right decision just now i.e if you don't wobble and go back again...

Daniel Shawul · Post by **Daniel Shawul** » Mon Oct 01, 2012 5:38 pm

I respect Kai, but he is deliberately ignoring facts. So it is indeed better no to waste time arguing this. People want to see what they want to see..

Modern Times · Post by **Modern Times** » Mon Oct 01, 2012 8:23 pm

Bayeselo is a better tool than Elostat. No reason whatsoever to use EloStat instead of Bayeselo.

Laskos · Post by **Laskos** » Mon Oct 01, 2012 10:19 pm

Michel wrote:
I see a problem with the "default" ratings. 200 points difference in Bayeselo default ratings do not predict 75% performance
Stop spouting this nonsense. I just did the calculation for drawelo=200 (fairly large) and the default scale 0.730126. Then one obtains that a 200 point rating difference computed by BayesElo corresponds to a score of 0.7717 which for the logistic distribution corresponds to 210 elo which is close enough for such a high elo difference.

For smaller elo differences or smaller values of drawelo the differences between BayesElo and logistic are almost invisible.

I added a graph for the elo/score relation for default scaled BayesElo (red) and logistic (blue) for drawelo=200.

http://hardy.uhasselt.be/Toga/elo_score.png

Now I am wondering why I am wasting all this precious my time on you since I already said that the logistic and the scaled bayeselo performance graph are almost the same... You simply refuse to read.

I don't know what you are babbling there. Adam shows a compression of 10-30% on ranges 800 Elo points or larger. You show a compression of (212-191)/191 ~ 10% on the 200 Elo points range, a quantity which is still large. My claim is that on CCRL ratings based on Bayeselo "default" the difference between Crafty and Houdini is compressed by at least 40 Elo points. A similar compression occurs with EloStat, but for different reasons. Ordo does predict correctly a larger difference between Crafty and Houdini. I hope you will not vaste your time showing that 10% compression shown even by you on a smaller range is irrelevant.

Kai

Laskos · Post by **Laskos** » Mon Oct 01, 2012 10:42 pm

By the way, your plot is a bit misleading, it seems there that the difference on the tails vanishes. Do a log(1-score) vs rating for the right tail to see better that it does not, in fact the tails are even more distorted in ratings.

Kai

Michel · Post by **Michel** » Mon Oct 01, 2012 11:32 pm

You show a compression of (212-191)/191 ~ 10% on the 200 Elo points range,

No it is (210-200)/200=5% in a situation with a a high drawelo (much higher than what would apply to CCRL).

5% is just noise. The elo model is only approximate. There is no true elo. It is expected that different ways of estimating elo would produce slightly different results since the underlying model is simply incomplete.

Michel · Post by **Michel** » Tue Oct 02, 2012 12:29 am

Kai: To reply to your comment on the tails

Assuming the default drawelo of 97.3 and the default scale 0.925497 one finds that an elo difference of 1000 in logistic gives the same score as 950 in scaled bayeselo.

Still only 5%. For larger drawelo's the difference becomes somewhat larger.

Now note that pitting a 1200 engine A against a 2200 engine B and playing a million games is not a realistic testing scenario.

Such high rating differences are measured through intermediate engines. The matches played by the intermediate engines will give higher log likelihood contributions than the matches played between A and B (taking into account the expected W/D/L). So they will dominate the ultimate rating computation. So the difference in tails between logistic and scaled BayesElo should be less important than it appears to be.

Laskos · Post by **Laskos** » Tue Oct 02, 2012 1:18 am

Michel wrote:
You show a compression of (212-191)/191 ~ 10% on the 200 Elo points range,
No it is (210-200)/200=5% in a situation with a a high drawelo (much higher than what would apply to CCRL).

5% is just noise. The elo model is only approximate. There is no true elo. It is expected that different ways of estimating elo would produce slightly different results since the underlying model is simply incomplete.

Did you take 75% or 200 points as the starter? 75% are 191 points. 77.17% are 212 points, so even if you took 200 it's 6%. If you took 75%, then 11%. Not exactly noise. For drawelo higher than the default, as it happens with CCRL databases, the differences on the tails will be much higher than 5%, I guess 10-20% or so. That's true that intermediate results give larger contributions, and I guess that a 10% compression is reasonable (I compared with Ordo ratings or by using simple examples which I can compute myself). Glad that the argument is now reduced to whether it's a 5% effect or 10%, I was getting pretty sick seeing that you and Daniel completely dismissed the problem. As for the model being incomplete, it's true, but Adam shows that fitting with a significantly smaller value than 400 logistic fits better Bayeselo predictions, therefore it's a systematic error of Bayeselo "default", and not a property of the engine-engine matches. It may well be that the true fit is even Gaussian, but that's not the point.

Kai

Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo