Ordo vs. Bayeselo

Discussion of chess software programming and technical issues.

Moderator: Ras

Modern Times
Posts: 3706
Joined: Thu Jun 07, 2012 11:02 pm

Re: Ordo vs. Bayeselo

Post by Modern Times »

Daniel Shawul wrote:
Modern Times wrote:
Daniel Shawul wrote:Kai, I think you simply don't want to accept the need for scaling. Here is my post trying to convince CCRL to go back to using calculated scale.
CCRL are dropping "scale 1" and going back to default scaling.
Thanks Ray for confirming that.

God knows if it is the right thing to do or not. I'm not qualified to judge myself, and even the experts don't agree.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Ordo vs. Bayeselo

Post by Michel »

I see a problem with the "default" ratings. 200 points difference in Bayeselo default ratings do not predict 75% performance
Stop spouting this nonsense. I just did the calculation for drawelo=200 (fairly large) and the default scale 0.730126. Then one obtains that a 200 point rating difference computed by BayesElo corresponds to a score of 0.7717 which for the logistic distribution corresponds to 210 elo which is close enough for such a high elo difference.

For smaller elo differences or smaller values of drawelo the differences between BayesElo and logistic are almost invisible.

I added a graph for the elo/score relation for default scaled BayesElo (red) and logistic (blue) for drawelo=200.

http://hardy.uhasselt.be/Toga/elo_score.png


Now I am wondering why I am wasting all this precious my time on you since I already said that the logistic and the scaled bayeselo performance graph are almost the same... You simply refuse to read.
Last edited by Michel on Mon Oct 01, 2012 5:35 pm, edited 1 time in total.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

Modern Times wrote:
Daniel Shawul wrote:
Modern Times wrote:
Daniel Shawul wrote:Kai, I think you simply don't want to accept the need for scaling. Here is my post trying to convince CCRL to go back to using calculated scale.
CCRL are dropping "scale 1" and going back to default scaling.
Thanks Ray for confirming that.

God knows if it is the right thing to do or not. I'm not qualified to judge myself, and even the experts don't agree.
Then why do you change it? Indeed you are not qualified to judge. Let the author of the tool make the judgement. Sure he is the expert of his tool , no? Also one would expect CCRL to have been loyal since one of the points of CEGT - CCRL split was to use bayeselo efficiently? CEGT still uses EloStat. I for one think you made the right decision just now i.e if you don't wobble and go back again...
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

I respect Kai, but he is deliberately ignoring facts. So it is indeed better no to waste time arguing this. People want to see what they want to see..
Modern Times
Posts: 3706
Joined: Thu Jun 07, 2012 11:02 pm

Re: Ordo vs. Bayeselo

Post by Modern Times »

Bayeselo is a better tool than Elostat. No reason whatsoever to use EloStat instead of Bayeselo.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ordo vs. Bayeselo

Post by Laskos »

Michel wrote:
I see a problem with the "default" ratings. 200 points difference in Bayeselo default ratings do not predict 75% performance
Stop spouting this nonsense. I just did the calculation for drawelo=200 (fairly large) and the default scale 0.730126. Then one obtains that a 200 point rating difference computed by BayesElo corresponds to a score of 0.7717 which for the logistic distribution corresponds to 210 elo which is close enough for such a high elo difference.

For smaller elo differences or smaller values of drawelo the differences between BayesElo and logistic are almost invisible.

I added a graph for the elo/score relation for default scaled BayesElo (red) and logistic (blue) for drawelo=200.

http://hardy.uhasselt.be/Toga/elo_score.png


Now I am wondering why I am wasting all this precious my time on you since I already said that the logistic and the scaled bayeselo performance graph are almost the same... You simply refuse to read.
I don't know what you are babbling there. Adam shows a compression of 10-30% on ranges 800 Elo points or larger. You show a compression of (212-191)/191 ~ 10% on the 200 Elo points range, a quantity which is still large. My claim is that on CCRL ratings based on Bayeselo "default" the difference between Crafty and Houdini is compressed by at least 40 Elo points. A similar compression occurs with EloStat, but for different reasons. Ordo does predict correctly a larger difference between Crafty and Houdini. I hope you will not vaste your time showing that 10% compression shown even by you on a smaller range is irrelevant.

Kai
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ordo vs. Bayeselo

Post by Laskos »

By the way, your plot is a bit misleading, it seems there that the difference on the tails vanishes. Do a log(1-score) vs rating for the right tail to see better that it does not, in fact the tails are even more distorted in ratings.

Kai
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Ordo vs. Bayeselo

Post by Michel »

You show a compression of (212-191)/191 ~ 10% on the 200 Elo points range,
No it is (210-200)/200=5% in a situation with a a high drawelo (much higher than what would apply to CCRL).

5% is just noise. The elo model is only approximate. There is no true elo. It is expected that different ways of estimating elo would produce slightly different results since the underlying model is simply incomplete.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Ordo vs. Bayeselo

Post by Michel »

Kai: To reply to your comment on the tails

Assuming the default drawelo of 97.3 and the default scale 0.925497 one finds that an elo difference of 1000 in logistic gives the same score as 950 in scaled bayeselo.

Still only 5%. For larger drawelo's the difference becomes somewhat larger.

Now note that pitting a 1200 engine A against a 2200 engine B and playing a million games is not a realistic testing scenario.

Such high rating differences are measured through intermediate engines. The matches played by the intermediate engines will give higher log likelihood contributions than the matches played between A and B (taking into account the expected W/D/L). So they will dominate the ultimate rating computation. So the difference in tails between logistic and scaled BayesElo should be less important than it appears to be.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Ordo vs. Bayeselo

Post by Laskos »

Michel wrote:
You show a compression of (212-191)/191 ~ 10% on the 200 Elo points range,
No it is (210-200)/200=5% in a situation with a a high drawelo (much higher than what would apply to CCRL).

5% is just noise. The elo model is only approximate. There is no true elo. It is expected that different ways of estimating elo would produce slightly different results since the underlying model is simply incomplete.
Did you take 75% or 200 points as the starter? 75% are 191 points. 77.17% are 212 points, so even if you took 200 it's 6%. If you took 75%, then 11%. Not exactly noise. For drawelo higher than the default, as it happens with CCRL databases, the differences on the tails will be much higher than 5%, I guess 10-20% or so. That's true that intermediate results give larger contributions, and I guess that a 10% compression is reasonable (I compared with Ordo ratings or by using simple examples which I can compute myself). Glad that the argument is now reduced to whether it's a 5% effect or 10%, I was getting pretty sick seeing that you and Daniel completely dismissed the problem. As for the model being incomplete, it's true, but Adam shows that fitting with a significantly smaller value than 400 logistic fits better Bayeselo predictions, therefore it's a systematic error of Bayeselo "default", and not a property of the engine-engine matches. It may well be that the true fit is even Gaussian, but that's not the point.

Kai