Ordo vs. Bayeselo

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 30, 2012 5:27 pm

You have been told many times if you would listen. Scale is a post processing parameter just like offset is. If you had used default calculated scale by mm then you would find rating scales similar to what everyone expects. You have invented a non-existing problem like 'compression' and used 'scale=1' to 'solve' it. How can you compare a scaled result with a bare elo model and say it compresses it ??. Scale=1 means not using scales similar to setting offset=0 means not using offset. Will you compare the model with offset set at 1500 with the model and say bayeselo shifts ratings? No, that is ridiclous. But we are still using offsets when reporting results, and so we should use scales too. Without scales ratings will not be comparable. I tested with 3 different bayeselo draw models and they all match their corresponding models very well.
Just apologize for causing this chaos with 'scale=1' now that every one thinks it was the default. FUD at its best.

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 30, 2012 5:35 pm

Michel wrote:
For Bayeselo, the ratings produced using the default values did not match the Bayeselo model. The ratings difference for a particular White score was smaller than expected. And the main reason was the scale parameter. When scale is set to 1 and the maximum likelihood values for White advantage and drawElo for the given database are used, then the ratings produced correspond very well with the Bayeselo model (with the correct White advantage and drawElo values inserted).
I still have difficulty understanding this (but I am trying).

I am ignoring the white/black issue (I am not entirely sure this is legitimate but the focus of this discussion is drawelo).

The unscaled ratings produces by BayesElo should predict correct winning percentages according to its own ("BayesElo") model.

On the other hand the scaled ratings produced by BayesElo should predict correct scores according to the logistic model (the scaling is done to fit the logistic model).

Are you saying that your experiments contradict this? If I read correctly then they don't.

Michel,he just screwed his comparisons so bad. I tested three different bayeselo draw models and they all match their models perfectly as shown below. There is no contradiction at all if you know what you are doing...

Note that Davidson (unlike bayeselo default Rao-Kupper) gives the largest unscaled elos. Now if I asked people to use this you can imagine how confused they will be with this uninformed 'scale=1' use we have here.

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 6:07 pm

We are not communicating very well. Let me try again.

Daniel Shawul wrote:
I am going to speak up here. Not because I am trying to be in opposition to you, nor to promote one ratings program over another one (I use all three, depending on the situation). In fact neither one of the authors have tried to promote their program over the other program.
Why is it hard to make a fair comparison between Ordo and bayeselo (state of the art) and list the improvements rather than leaving testers in endless confusion ('scale 1') in the hope that people choose other tools. There is a name for such intellectual dishonesty i.e ignoring other people's work all in all and stating your work started from scratch. The mass would like to know if there are improvements on the state of the art not if someone invented the wheel yet again.

Again, I am the source of the confusion about 'scale 1'. But I have not committed any "intellectual dishonesty" concerning 'scale'. I have no reason to do so, especially since my goal is to determine the best way to compute ratings.

To reiterate, I found that the default parameter values and the default method for determining the scale of Bayeselo ratings produces ratings differences that are different then what is expected from the Bayeselo model. Setting scale equal to 1 and using the maximum likelihood values for White advantage and drawElo produces ratings that are in good agreement with the Bayeselo model. Unfortunately, that makes rating differences from different databases much less comparable.

Daniel Shawul wrote:
As far as I can tell, the only person that has treated this as a popularity contest is you. If some person, rightly or wrongly, says there is some problem with Bayeselo, you go out of your way to denigrate Ordo in the act of defending Bayeselo. If this is not a popularity contest, then we should be able to talk about the strengths and weaknesses of both tools in a rational manner with out the need to resort to rhetoric. The best way to judge between the two tools is to check the predictive power of both tools. I had started this a few weeks ago, but I have been more interested in studying material imbalances and parameter tuning lately. If necessary, I can start back on that. I can tell you that both tools do well with well connected sets of games, as any decent rating tool should.
Now you are replying to an old post ??? I am not the one who has a webpage promoting ordo, and is effectively co-author, making unfair comparisons etc. Someone like me is going to speak up when you base your conclusions on things you introduced yourself (sclae=1)

Ordo don't have draw models
Ordo don't have white advantage (atleast before it picks it up from bayeselo)
Ordo don't have LOS
Ordo don't have error bars
Ordo don't have a lot more other features
Ordo has infoerior elo algorithms.
Now why would it would be any better if it has inferior algorithms?? That boggles my mind. If you think it is better list what makes it better like I did for the opposite.

At no point have I inferred that Bayeselo is inferior to any rating tool. To carry that further, I have at no time recommended that anyone use Ordo instead of Bayeselo. That includes the CCRL. If you were to ask privately other members of the CCRL, I believe you would find that, in internal discussions, I have been working to find the best way to use Bayeselo to produce the CCRL ratings. I believe that the last mention of Ordo in the internal CCRL forum showed how slow it was in comparison to Bayeselo (that has changed for the most part).

I am not certain who you are referring to in relation to the webpage. If you are talking about Miguel, I believe it is only natural for a programmer that is sharing a program freely to offer it at his web site. If you look at his webpage for Ordo, it is not being touted as being better than any other program. And links to Bayeselo and Elostat are provided. If you are talking about me, I have one page that compares the three tools in the manner I keep referring to. In that comparison, Bayeselo is not compared unfavorably to Ordo. In fact, I point out that Bayeselo can produce reasonable ratings when Ordo (that version) and Elostat can not. I also have a page about computer chess utilities that includes information about Bayeselo, Elostat, and Ordo. Bayeselo is listed first. Ordo is listed last, and the information is old. I truly can not understand why you insist that I have been promoting Ordo over Bayeselo.

By the way, the list of things Ordo does not have is incorrect. To avoid any charge of promotion, I will not provide a list of corrections. Instead, I will list additional things that Bayeselo can do that Ordo can't:

Bayeselo can produce the match results for each engine
Bayeselo can produce an engine's performance in a particular match
Bayeselo can produce plots of result likelihoods as a function of rating difference and draw frequency as a function of average rating

Daniel Shawul wrote:
Getting to Larry's question, the extra parameters of Bayeselo are like a double-edged sword. Using the default values keeps the scale of ratings from different databases nearly the same, which allows for comparison of ratings from different databases. But that throws out much of the extra information that Bayeselo's refinements can wring from a database. In other words, using the default values can make the ratings less accurate than those from the simpler logistic model. Using the estimated parameter values may make the estimated ratings more accurate, but then the ratings will be less useful (uncomparable to other ratings due to the dependency on the draw rate).

The best thing to do (if applicable) is to combine all of the games together and used the estimated values for White advantage and drawElo, and let scale=1.

ROFL you still pushing scale=1. The best thing for you would be to ask the author and educate yourself. Remi never said you use that and infact explained to you why your comparison was bullshit but you are still pushing this afterall since it suits your purpose.
However, if the draw rate varies across the entire database, it is not clear that whatever drawElo that is used, whether it be the default or estimated from the database, produces more accurate ratings than the simpler logistic model without draws. The only way to be certain (as far as I know) is to check the predictive power of the ratings via cross-validation for a particular database.

Rémi explained to me why the so-called compression was probably occurring, which was due to the default method for determining the scale.

How does any of this suit my purpose? The only connection I have to the actual tool is that Miguel is my friend. I have been involved with the more recent development of Ordo solely because my comparison showed some flaws with Ordo, and that I have been studying ratings for a while. But, I have never promoted Ordo over Bayeselo. I still use Bayeselo. I do use Ordo to produce my rating list, but I do not think those ratings are better than those produced by Bayeselo. I am simply using a friend's program. And if I ever feel that it produces less accurate ratings, I will not use it.

Daniel Shawul wrote:
In the end, I think that the best rating tool may depend on the database and the purpose for the ratings
Nope, they aren't even comparable. This is another promotion which leaves the users in doubt. From algorithm or feature point of view there is no comparison at all.

Take the blinders off, Daniel. I have raised a legitimate concern I have with Bayeselo, but I have not proposed that Ordo is the solution. My purpose for including Jesús Munoz's statement and your sharp response was to point out that the only person who has been attacking any person or tool is you. I am not promoting Ordo, nor have I ever attacked or disparaged you, Rémi, or Bayeselo. Reread everything without the preconceived notion that anyone is trying to put Bayeselo in a bad light.

In the near future, I hope to produce a study of how well Bayeselo and Ordo ratings predict results. I already know that both do reasonably well with a well connected database. And no program can do a great job with something as sparse as ChessWar. So, I will use WBEC and cross validation to judge the predictive abilities. Since Rémi did this with Bayeselo during its development (that is what lead to the use of virtual draws as the prior to replace a uniform prior distribution), I believe that Bayeselo will probably do better.

Michel · Post by **Michel** » Sun Sep 30, 2012 6:21 pm

I have raised a legitimate concern I have with Bayeselo,

Adam:

I hope you have read my earlier mail on this, but let me reiterate.

Your concern is not valid (and there is no need for a "solution").

The scaled ratings produced by BayesElo do not fit the BayesElo model (which is your concern). This is normal. They are designed to fit the logistic model.

The unscaled ratings do fit the BayesElo model (as you observed in your experiments) but they are inflated with respect to the usual ratings. This is normal too since they measure something different. One could say they are not expressed in elos but in bayeselos. With this terminology the scale parameter converts bayeselos to elos.

lkaufman · Post by **lkaufman** » Sun Sep 30, 2012 6:30 pm

I'd like to propose a related question. Let's say that we want to know whether A or B is stronger, i.e. which would win in a direct match. Version A scores 55% against a foreign gauntlet, B scores 56%, so B is 7 elo stronger according to normal Elo calculations and to Ordo. But let's say that A ends up rated 7 elo higher according to Bayeselo (which I believe can and does happen sometimes, due to differing draw rates and to which programs each scored better or worse against). Should you bet your money on A or on B in a direct match? Aside from just expressing opinions, does anyone have any data that would help answer this question?

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 6:36 pm

Michel wrote:
For Bayeselo, the ratings produced using the default values did not match the Bayeselo model. The ratings difference for a particular White score was smaller than expected. And the main reason was the scale parameter. When scale is set to 1 and the maximum likelihood values for White advantage and drawElo for the given database are used, then the ratings produced correspond very well with the Bayeselo model (with the correct White advantage and drawElo values inserted).
I still have difficulty understanding this (but I am trying).

I am ignoring the white/black issue (I am not entirely sure this is legitimate but the focus of this discussion is drawelo).

The unscaled ratings produced by BayesElo should predict correct winning percentages according to its own ("BayesElo") model.

On the other hand the scaled ratings produced by BayesElo should predict correct scores according to the logistic model (the scaling is done to fit the logistic model).

Are you saying that your experiments contradict this? If I read correctly then they don't.

No. They do not contradict your statement. And perhaps your wording is how I should be thinking about this. I have been comparing the scaled ratings to the Bayeselo model, instead of the logistic model. I have wrongly thought that scaling affected the predictive properties of Bayeselo. If it does not, then I am a happy man.

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 6:38 pm

Michel wrote:
I have raised a legitimate concern I have with Bayeselo,
Adam:

I hope you have read my earlier mail on this, but let me reiterate.

Your concern is not valid (and there is no need for a "solution").

The scaled ratings produced by BayesElo do not fit the BayesElo model (which is your concern). This is normal. They are designed to fit the logistic model.

The unscaled ratings do fit the BayesElo model (as you observed in your experiments) but they are inflated with respect to the usual ratings. This is normal too since they measure something different. One could say they are not expressed in elos but in bayeselos. With this terminology the scale parameter converts bayeselos to elos.

I have caught up with your posts and now understand my misconception. Thank you very much, Michel.

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 30, 2012 6:55 pm

lkaufman wrote:I'd like to propose a related question. Let's say that we want to know whether A or B is stronger, i.e. which would win in a direct match. Version A scores 55% against a foreign gauntlet, B scores 56%, so B is 7 elo stronger according to normal Elo calculations and to Ordo. But let's say that A ends up rated 7 elo higher according to Bayeselo (which I believe can and does happen sometimes, due to differing draw rates and to which programs each scored better or worse against). Should you bet your money on A or on B in a direct match? Aside from just expressing opinions, does anyone have any data that would help answer this question?

The problem with your question is that you hope Ordo may bring improvement when it has inferior algorithms. It simply can't. Someone should first do analysis of what improvements if any Ordo brings. Remi did such comparison against state of the art ( EloStat at the time) when he first introduced bayeselo. http://remi.coulom.free.fr/Bayesian-Elo/ . The improvements of bayeselo are there for every one to see. Nothing like that from Ordo guys aside from spreading 'misconceptions' (now admitted by Adam) of bayeselo to look good. They know it is inferior so only chance is FUD (thanks Michel

)
And don't use scale=1.

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 6:56 pm

Daniel Shawul wrote:You have been told many times if you would listen. Scale is a post processing parameter just like offset is. If you had used default calculated scale by mm then you would find rating scales similar to what everyone expects. You have invented a non-existing problem like 'compression' and used 'scale=1' to 'solve' it. How can you compare a scaled result with a bare elo model and say it compresses it ??. Scale=1 means not using scales similar to setting offset=0 means not using offset. Will you compare the model with offset set at 1500 with the model and say bayeselo shifts ratings? No, that is ridiclous. But we are still using offsets when reporting results, and so we should use scales too. Without scales ratings will not be comparable. I tested with 3 different bayeselo draw models and they all match their corresponding models very well.
Just apologize for causing this chaos with 'scale=1' now that every one thinks it was the default. FUD at its best.

I apologize. I have been apologizing about this, all the while hoping for an explanation. Have you forgot about the earlier threads?

http://talkchess.com/forum/viewtopic.ph ... 68&t=44900

Will you apologize to me for misconstruing my intentions?

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 6:59 pm

Daniel Shawul wrote:
lkaufman wrote:I'd like to propose a related question. Let's say that we want to know whether A or B is stronger, i.e. which would win in a direct match. Version A scores 55% against a foreign gauntlet, B scores 56%, so B is 7 elo stronger according to normal Elo calculations and to Ordo. But let's say that A ends up rated 7 elo higher according to Bayeselo (which I believe can and does happen sometimes, due to differing draw rates and to which programs each scored better or worse against). Should you bet your money on A or on B in a direct match? Aside from just expressing opinions, does anyone have any data that would help answer this question?
The problem with your question is that you hope Ordo may bring improvement when it has inferior algorithms. It simply can't. Someone should first do analysis of what improvements if any Ordo brings. Remi did such comparison against state of the art ( EloStat at the time) when he first introduced bayeselo. http://remi.coulom.free.fr/Bayesian-Elo/ . The improvements of bayeselo are there for every one to see. Nothing like that from Ordo guys aside from spreading 'misconceptions' (now admitted by Adam) of bayeselo to look good. They know it is inferior so only chance is FUD (thanks Michel )
And don't use scale=1.

I don't understand you

Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo