SCCT Rating List - Calculation by EloStat 1.3

Laskos · Post by **Laskos** » Tue Aug 28, 2012 4:57 pm

Daniel Shawul wrote:
Laskos wrote:
Daniel Shawul wrote:Hello Adam
I didn't know you have that problem with the complete/pure list too. Well in that case I think using scale becomes even more necessary. For other models where the formula don't give you elos close to arpad's assumption scaling would be even more appropriate. Also Remi prefered use of scaling (made it default) so I would think using it would be safer. It doesn't change anything else other than the magnitude of the elo differences.
Danile
I don't understand how it doesn't change anything else. If 200-75% is not preserved, what is the meaning of those numbers given as rating?
When you use scale = 1, you are not preserving the 200-75% assumption.
Bayeselo has eloDraw and eloAdvantage that you need to add to the eloDelta to see the 200-75% assumption. But people want to see that with just eloDelta (i.e when comparing two engines). That is why an arbitrary factor was needed to be applied but it doesn't need to be. It doesn't change anything else in the sense that the order and relative elo differences are preserved.
Code: Select all
Rating  = scale * Original values + offset
Even the transitivity is not preserved using arbitary scaling. I understand by Elo rating that if I pick from the rating list an engine rated 2900 and another rated 2700, then the prediction is that the engine rated 2900 will score 75% in a match against the engine rated 2700. Bayeselo default fails in its prediction or maybe there are secret tables to derive the predictions of which I am not aware. So, tell me, with default Bayeselo, what is the prediction in % for those 2900 and 2700 rated (by Bayeselo default) engines in a match?
Kai
As I explained above bayeselo uses logistic by adding two more parameters
Code: Select all
logistic(-eloDelta - eloAdvantage + eloDraw);
So you need to add and decreas eloAdvantage and eloDraw to the differece to see the 200-75%. With another draw model that uses logistic like this
Code: Select all
double f = thetaD * sqrt(logistic(eloDelta + eloAdvantage) * logistic(-eloDelta - eloAdvantage));
return logistic(eloDelta + eloHome) / (1 + f);
,the default values you get may be even more magnified. But it is better to ask Remi ,this is just my opinion from making elo ratings on ccrl 40/40 with this model. Remi had a reason to apply the scaling by default from what I read.

Daniel

I don't see why it's necessary to lose the % predictive power by not adjusting the scale. To have some "imponderable" 200 points difference in ratings without any meaning? Better adjust the scale and say what this 200 means as %. Setting "scale 1" I was getting results which were giving very good predictions for individual matches (in 200-75% sense), maybe you can explain why is that, taking into account DrawElo and EloAdvantage. I think Adam is wrong not to follow that "scale 1" procedure just because others either use the wrong EloStat or the default (wrong as % interpretation goes) Bayeselo. Ordo gives good predictions in the 200-75% sense, by the way.

Kai

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 28, 2012 5:19 pm

I don't see why it's necessary to lose the % predictive power by not adjusting the scale. To have some "imponderable" 200 points difference in ratings without any meaning? Better adjust the scale and say what this 200 means as %. Setting "scale 1" I was getting results which were giving very good predictions for individual matches (in 200-75% sense), maybe you can explain why is that, taking into account DrawElo and EloAdvantage. I think Adam is wrong not to follow that "scale 1" procedure just because others either use the wrong EloStat or the default (wrong as % interpretation goes) Bayeselo. Ordo gives good predictions in the 200-75% sense, by the way.

Kai

Yes we agree there is a dilemma. The benefits of using calculated scaling such as 0.834 is that when you say Houdini is +50 elo stronger than stockfish in CCRL, it probably means the same difference in CEGT as well. With scale = 1 and a different draw model f.i, it could be +60 elo difference. Use of scaling doesn't change order nor does it change relative elo difference. By the latter I mean the ratio of relative difference will be preserved. For ratings of x,y,z and a scaled one with mx,my,mz , the ratio of rating differences between the engines is equal (mz - my) / (my - mx) = (z - y) / (y - x). You are not going to substitute the modifed ratings mx directly into model so it doesn't loose accuracy. It is just post processing applied to quench the thirst of some people that want to see 75% winning percentage for a 200 elo difference.

Laskos · Post by **Laskos** » Tue Aug 28, 2012 5:44 pm

Daniel Shawul wrote:
I don't see why it's necessary to lose the % predictive power by not adjusting the scale. To have some "imponderable" 200 points difference in ratings without any meaning? Better adjust the scale and say what this 200 means as %. Setting "scale 1" I was getting results which were giving very good predictions for individual matches (in 200-75% sense), maybe you can explain why is that, taking into account DrawElo and EloAdvantage. I think Adam is wrong not to follow that "scale 1" procedure just because others either use the wrong EloStat or the default (wrong as % interpretation goes) Bayeselo. Ordo gives good predictions in the 200-75% sense, by the way.

Kai
Yes we agree there is a dilemma. The benefits of using calculated scaling such as 0.834 is that when you say Houdini is +50 elo stronger than stockfish in CCRL, it probably means the same difference in CEGT as well. With scale = 1 and a different draw model f.i, it could be +60 elo difference. Use of scaling doesn't change order nor does it change relative elo difference. By the latter I mean the ratio of relative difference will be preserved. For ratings of x,y,z and a scaled one with mx,my,mz , the ratio of rating differences between the engines is equal (mz - my) / (my - mx) = (z - y) / (y - x). You are not going to substitute the modifed ratings mx directly into model so it doesn't loose accuracy. It is just post processing applied to quench the thirst of some people that want to see 75% winning percentage for a 200 elo difference.

Yes, my thirst too, folks are thinking that the Elo points they see in CCRL or CEGT lists are interpretable in fact as Elo points. Without adjusting the scale to 200-75% model, one must call them Bayeselo points and EloStat points, and what Adam is proposing is the return of CCRL to the default Bayeselo points (instead of the more or less correct Elo points using "scale 1") just for the sake of comparing many lists.

Kai

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 28, 2012 7:18 pm

Yes, my thirst too, folks are thinking that the Elo points they see in CCRL or CEGT lists are interpretable in fact as Elo points. Without adjusting the scale to 200-75% model, one must call them Bayeselo points and EloStat points, and what Adam is proposing is the return of CCRL to the default Bayeselo points (instead of the more or less correct Elo points using "scale 1") just for the sake of comparing many lists.

Kai

Well if it is your thirst too, you should support the switch back to calculated scale and not scale=1 because that is what gives you 200elo-75% for displayed elos of engines. Remi had no scaling originally but he was requested to add it by Leo (and others I think), so he added a scale to transform the data. So if you want to compare different tools (elostat,bayeselo,ordo) you remove both the scale and offset. The values clearly don't match without the scaling but it doesn't make it wrong

Code: Select all

score = scale * Elo + offset

Comparisons were done without removing the scale resulting in the confusion called 'compression'. But in reality not removing the scale makes as much sense as not removing the offset. You wouldn't compare rating list with fruit set at 2500 and another with fruit 2700 ? While that seems so obvious, the scale should be removed too IMO. Also ordo just added a scale parameter now that it has white advantage I think. What that tells you is when you add eloAdvantage and eloDraw and other stuff , you would need to adjust your results back to what people are used to see. But instead the advanced options of bayeselo was taken against it and confusions arose that it 'compresses' ratings etc... Elo stat and ordo (no white advantage and draw models) have simple models, so the fact that bayeselo has advanced features shouldn't be taken against it..

Disclaimer: Anything I said here is my opinion and Remi has nothing to do with it. Just want to make it clear.

Adam Hair · Post by **Adam Hair** » Tue Aug 28, 2012 8:39 pm

I have several things to say to go along with the discussion you and Kai are having, but I do not have time at this moment. I did want to say that Ordo does not have a scale factor and it does not take into account draws, yet.

Modern Times · Post by **Modern Times** » Tue Aug 28, 2012 10:16 pm

Daniel Shawul wrote:
Modern Times wrote:Thanks. Adam knows better, but I think it is "mm 1 1" that causes the differences between our pure and compete databases, not scale 1.
I don't think so. That was what was originally suspected for the 'compression' effect but after remi pointed out use of scaling parameter the problem went away. So ccrl decided to use scale = 1 to avoid compression but I am not sure it is the right thing to do if you want to compare to other rating lists.

Actually I checked, it does. We ran two lists side by side, the old method and then just with mm 1 1. The difference between the complete lists and pure lists emerges there. The profile of the two databases is very diffferent according to Adam - draw rates and average Elo difference.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 28, 2012 10:31 pm

Actually I checked, it does. We ran two lists side by side, the old method and then just with mm 1 1. The difference between the complete lists and pure lists emerges there. The profile of the two databases is very diffferent according to Adam - draw rates and average Elo difference.

Have you read Adam's reply ?

'mm 1 1' does cause a difference between the complete list and the pure list ratings.. However, 'scale 1' is the cause of the large difference. The default method of computing the scale modulates those differences.

So it is the scaling that causes the compression effect observed at that time. Adam said it exists in pure/complete rating lists which I didn't know before (note I didn't say anything about that). Using mm 1 1 vs mm alone brings differences but that is expected. If your data is big enough it is better to use mm 1 1, than use default values which were calculated from WBEC ( which is a smaller database to begin with).
Without the scaling you just can not compare CCRL elo differences with CEGT f.i which is the point I am trying to make. I think this is easy enough to understand for those with open mind...

Modern Times · Post by **Modern Times** » Tue Aug 28, 2012 10:55 pm

What is this "open mind" nonsense ????

I know what I'm talking about, I ran the numbers myself. I published quite a few different ratings lists with all of the various combinations of parameters we were looking at. We are talking about different things.

Previously the ratings of engines on the pure list and complete list were usually just a handful of Elo apart, sometimes a little more. The top engine on both lists was roughly the same Elo. With just the mm 1 1 change, suddenly all engines were 20-30 Elo different on the two lists. So I repeat what I said - it is the mm 1 1 change that made the pure lists and complete lists different from each other, *before* we started to play with the scale parameter. Sure, the scale parameter may have had an additional effect, but mm 1 1 on it's own made the pure lists and complete lists different. Fact. It shifted the two lists. Clearly I'm not intelligent enough to contribute further, so I won't.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 28, 2012 11:06 pm

Modern Times wrote:What is this "open mind" nonsense ????

I know what I'm talking about, I ran the numbers myself. I published quite a few different ratings lists with all of the various combinations of parameters we were looking at. We are talking about different things.

Again I did not say anything about pure / complete rating list differences. Why you are attributing that to me I don't know ? Infact I clearly expressed my surprize to learn about that... So if you say it is mm 1 1 , it is OK with me, because I didn't say that was what is causing the difference. Do you understand that ?

Previously the ratings of engines on the pure list and complete list were usually just a handful of Elo apart, sometimes a little more. The top engine on both lists was roughly the same Elo. With just the mm 1 1 change, suddenly all engines were 20-30 Elo different on the two lists. So I repeat what I said - it is the mm 1 1 change that made the pure lists and complete lists different from each other, *before* we started to play with the scale parameter. Sure, the scale parameter may have had an additional effect, but mm 1 1 on it's own made the pure lists and complete lists different. Fact. It shifted the two lists. Clearly I'm not intelligent enough to contribute further, so I won't.

You say mm 1 1 is the major cause, Adam say scale=1 is the major cause, I said exactly _nothing_ about that. And yet you keep telling the wrong person (me) about that. If you back up a bit I was interested in comparing different rating lists. If you agree that CEGT and CCRL can't be compared if we use scale=1 then that is good enough for me.

Modern Times · Post by **Modern Times** » Tue Aug 28, 2012 11:07 pm

Daniel Shawul wrote:If you agree that CEGT and CCRL can't be compared if we use scale=1 then that is good enough for me.

Agreed.

SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3

Re: SCCT Rating List - Calculation by EloStat 1.3