Ordo vs. Bayeselo

lkaufman · Post by **lkaufman** » Sat Sep 29, 2012 10:17 pm

We have recently tried rating our matches and tournaments with both Ordo and Bayeselo to compare, with astonishing results. When we run gauntlets comparing two Komodo versions against a set of foreign opponents, the elo differences come out somewhat larger (on the order of 10% or so) with Bayeselo. But when we run direct matches between two versions, the elo differences are often vastly larger (on the order of 50%) with Bayeselo (!). A difference of 50% in scale is ridiculous, it means the two systems are VERY different. Since Ordo seems to correspond perfectly with normal elo differences for the given result, this seems to cast grave doubt on Bayeselo, unless our settings are the problem.
Our settings are mm 1 1 and scale 1, everything else default.
One clue: the highly inflated rating differences with Bayeselo seem to occur when the draw percentage is high. When it is in the thirties, everything looks pretty normal, but when it is over 50% we see these absurd elo differences with Bayeselo.
So, what is causing this behavior? Are the settings the problem? Does this mean that we are better off to use Ordo, if we want to see rating differences that bear some resemblances to the Elo rating system?

Daniel Shawul · Post by **Daniel Shawul** » Sat Sep 29, 2012 11:02 pm

lkaufman wrote:We have recently tried rating our matches and tournaments with both Ordo and Bayeselo to compare, with astonishing results. When we run gauntlets comparing two Komodo versions against a set of foreign opponents, the elo differences come out somewhat larger (on the order of 10% or so) with Bayeselo. But when we run direct matches between two versions, the elo differences are often vastly larger (on the order of 50%) with Bayeselo (!). A difference of 50% in scale is ridiculous, it means the two systems are VERY different. Since Ordo seems to correspond perfectly with normal elo differences for the given result, this seems to cast grave doubt on Bayeselo, unless our settings are the problem.
Our settings are mm 1 1 and scale 1, everything else default.
One clue: the highly inflated rating differences with Bayeselo seem to occur when the draw percentage is high. When it is in the thirties, everything looks pretty normal, but when it is over 50% we see these absurd elo differences with Bayeselo.
So, what is causing this behavior? Are the settings the problem? Does this mean that we are better off to use Ordo, if we want to see rating differences that bear some resemblances to the Elo rating system?

Scale 1 is a mole introduced by KGB to bad mouth bayeselo

Seriously you should avoid using that and stick with defaults or what Remi tells you to use. AFAIK ordo brings nothing to the table algorithm wise (and is infact inferior to a maximum likelihood algorithms) plus it lacks so many features. However they do make lots of noise and unfair comparisons...
Ordo don't have draw models,LOS,error bars, prediction but it does have inferior algorithms. For me,rating tools should not be like chess engines where every one has one but more like a group effort refining known rating systems. When bayeselo came in its improvements are there for everyone to see and documented on the webpage. I encourage anyone interested to do the same thing and compare Ordo with bayeselo in a fair manner. So a better question for you would be to ask what is Ordo have that it has such big differences with the state of the art? And is it being better by doing so ?

lkaufman · Post by **lkaufman** » Sat Sep 29, 2012 11:33 pm

Daniel Shawul wrote:
lkaufman wrote:We have recently tried rating our matches and tournaments with both Ordo and Bayeselo to compare, with astonishing results. When we run gauntlets comparing two Komodo versions against a set of foreign opponents, the elo differences come out somewhat larger (on the order of 10% or so) with Bayeselo. But when we run direct matches between two versions, the elo differences are often vastly larger (on the order of 50%) with Bayeselo (!). A difference of 50% in scale is ridiculous, it means the two systems are VERY different. Since Ordo seems to correspond perfectly with normal elo differences for the given result, this seems to cast grave doubt on Bayeselo, unless our settings are the problem.
Our settings are mm 1 1 and scale 1, everything else default.
One clue: the highly inflated rating differences with Bayeselo seem to occur when the draw percentage is high. When it is in the thirties, everything looks pretty normal, but when it is over 50% we see these absurd elo differences with Bayeselo.
So, what is causing this behavior? Are the settings the problem? Does this mean that we are better off to use Ordo, if we want to see rating differences that bear some resemblances to the Elo rating system?
Scale 1 is a mole introduced by KGB to bad mouth bayeselo Seriously you should avoid using that and stick with defaults or what Remi tells you to use. AFAIK ordo brings nothing to the table algorithm wise (and is infact inferior to a maximum likelihood algorithms) plus it lacks so many features. However they do make lots of noise and unfair comparisons...
Ordo don't have draw models,LOS,error bars, prediction but it does have inferior algorithms. For me,rating tools should not be like chess engines where every one has one but more like a group effort refining known rating systems. When bayeselo came in its improvements are there for everyone to see and documented on the webpage. I encourage anyone interested to do the same thing and compare Ordo with bayeselo in a fair manner. So a better question for you would be to ask what is Ordo have that it has such big differences with the state of the art? And is it being better by doing so ?

Perhaps the specific behavior I note was from using scale = 1. But more generally, I have grave doubts about the model underlying Bayeselo, which as HGM has explained pretty much assumes that one win plus one loss is like one draw, rather than like two draws. When the draw percentage is way above or below the default, I think this causes the rating differences to move away from what the elo system (or Ordo) would give (correct me if this is wrong). Maybe bayeselo would be fine if that was the rating system used by FIDE and everyone else, but since we expect rating differences to imply certain percentage scores (according to the elo system), if Bayeselo can give results that differ markedly from that, it becomes incompatible with elo ratings. Also, it seems from what I observed (again, subject to scale = 1 being a possible cause) that Bayeselo inflates the rating differences of direct matches (perhaps due to higher draw rates in self-play), which in my view are already overstated even if rated "properly". So I think Ordo may be a better predictor of ratings on actual rating lists, especially if we are rating self-play matches.

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 4:37 am

Daniel Shawul wrote:
lkaufman wrote:We have recently tried rating our matches and tournaments with both Ordo and Bayeselo to compare, with astonishing results. When we run gauntlets comparing two Komodo versions against a set of foreign opponents, the elo differences come out somewhat larger (on the order of 10% or so) with Bayeselo. But when we run direct matches between two versions, the elo differences are often vastly larger (on the order of 50%) with Bayeselo (!). A difference of 50% in scale is ridiculous, it means the two systems are VERY different. Since Ordo seems to correspond perfectly with normal elo differences for the given result, this seems to cast grave doubt on Bayeselo, unless our settings are the problem.
Our settings are mm 1 1 and scale 1, everything else default.
One clue: the highly inflated rating differences with Bayeselo seem to occur when the draw percentage is high. When it is in the thirties, everything looks pretty normal, but when it is over 50% we see these absurd elo differences with Bayeselo.
So, what is causing this behavior? Are the settings the problem? Does this mean that we are better off to use Ordo, if we want to see rating differences that bear some resemblances to the Elo rating system?
Scale 1 is a mole introduced by KGB to bad mouth bayeselo Seriously you should avoid using that and stick with defaults or what Remi tells you to use. AFAIK ordo brings nothing to the table algorithm wise (and is infact inferior to a maximum likelihood algorithms) plus it lacks so many features. However they do make lots of noise and unfair comparisons...
Ordo don't have draw models,LOS,error bars, prediction but it does have inferior algorithms. For me,rating tools should not be like chess engines where every one has one but more like a group effort refining known rating systems. When bayeselo came in its improvements are there for everyone to see and documented on the webpage. I encourage anyone interested to do the same thing and compare Ordo with bayeselo in a fair manner. So a better question for you would be to ask what is Ordo have that it has such big differences with the state of the art? And is it being better by doing so ?

I am going to speak up here. Not because I am trying to be in opposition to you, nor to promote one ratings program over another one (I use all three, depending on the situation). In fact neither one of the authors have tried to promote their program over the other program.

You took offense to an innocent statement by Jesús Munoz:

Ajedrecista wrote:Ratings are less compressed than in BayesElo, as many people have noted since Ordo 0.2 release; I do not find anything wrong with either algorithm, they are just different, that is all! Just for comparison, here is BayesElo output of the same PGN, courtesy of Adam (download link):

Since that point, you have been quite aggressive in your statements about Ordo. You continually pronounce how inferior Ordo is to Bayeselo. You definitely have the right to your opinion. But to state that "they" make a lot of noise and "unfair" comparisons bewilders me. The only comparison I know of that was unfair was my own in this thread. I had forgotten about the scale parameter, and I corrected my mistake as soon as Rémi Coulom pointed it out.

Daniel Shawul wrote: Well it is my opinion that bayeselo is far advanced than the rest and YET that has been taken against it to claim it compresses ratings etc... User don't know how to use tool, user blames tool. I have nothing against additional rating tools (but I know how this would sound after the whole day I spent arguing against). It is just that there is some very wrong notion (as demonstrated by Jesus's post) that got into people that bayeselo compresses ratings which has been used against it to praise others.. For me it is not a popularity contest, but objective discussions like what the algorithmic differences are b/n the two would show which is better. Oh btw bayeselo has simulation tools and a whole lot of other fancy things..

As far as I can tell, the only person that has treated this as a popularity contest is you. If some person, rightly or wrongly, says there is some problem with Bayeselo, you go out of your way to denigrate Ordo in the act of defending Bayeselo. If this is not a popularity contest, then we should be able to talk about the strengths and weaknesses of both tools in a rational manner with out the need to resort to rhetoric. The best way to judge between the two tools is to check the predictive power of both tools. I had started this a few weeks ago, but I have been more interested in studying material imbalances and parameter tuning lately. If necessary, I can start back on that. I can tell you that both tools do well with well connected sets of games, as any decent rating tool should.

Getting to Larry's question, the extra parameters of Bayeselo are like a double-edged sword. Using the default values keeps the scale of ratings from different databases nearly the same, which allows for comparison of ratings from different databases. But that throws out much of the extra information that Bayeselo's refinements can wring from a database. In other words, using the default values can make the ratings less accurate than those from the simpler logistic model. Using the estimated parameter values may make the estimated ratings more accurate, but then the ratings will be less useful (uncomparable to other ratings due to the dependency on the draw rate).

The best thing to do (if applicable) is to combine all of the games together and used the estimated values for White advantage and drawElo, and let scale=1. However, if the draw rate varies across the entire database, it is not clear that whatever drawElo that is used, whether it be the default or estimated from the database, produces more accurate ratings than the simpler logistic model without draws. The only way to be certain (as far as I know) is to check the predictive power of the ratings via cross-validation for a particular database.

In the end, I think that the best rating tool may depend on the database and the purpose for the ratings

lkaufman · Post by **lkaufman** » Sun Sep 30, 2012 5:03 am

Adam Hair wrote: Getting to Larry's question, the extra parameters of Bayeselo are like a double-edged sword. Using the default values keeps the scale of ratings from different databases nearly the same, which allows for comparison of ratings from different databases. But that throws out much of the extra information that Bayeselo's refinements can wring from a database. In other words, using the default values can make the ratings less accurate than those from the simpler logistic model. Using the estimated parameter values may make the estimated ratings more accurate, but then the ratings will be less useful (uncomparable to other ratings due to the dependency on the draw rate).

The best thing to do (if applicable) is to combine all of the games together and used the estimated values for White advantage and drawElo, and let scale=1. However, if the draw rate varies across the entire database, it is not clear that whatever drawElo that is used, whether it be the default or estimated from the database, produces more accurate ratings than the simpler logistic model without draws. The only way to be certain (as far as I know) is to check the predictive power of the ratings via cross-validation for a particular database.

In the end, I think that the best rating tool may depend on the database and the purpose for the ratings

Thanks for a very good explanation of the relevant issues. I think that for engine testers who, like us, use both self-play matches and gauntlets against foreign programs, and who use widely varying time limits, it is better to use Ordo, because the draw percentages vary quite a bit, and as I've discovered, this makes a HUGE difference with Bayeselo if the draw percentages are taken from the data, while using the default should not work well if the draw percentage is far from the default value. I suspect that CCRL and CEGT would also do better to use Ordo, because the wide range of time limits used by both implies a wide range of draw rates, which seems to be crippling for Bayeselo. Maybe for IPON Bayeselo is okay since only one time limit is used and the test conditions are constant. I greatly admire the work of Remi, Bayeselo was very well thought out and would be a reasonable alternative to the Elo rating system if the underlying model can be proved to be sound, but I think it falls short as a predictor of Elo ratings in a world of widely varying draw rates. I think we will switch to Ordo for our internal testing because of this problem, unless someone can convince me that this varying draw problem is solvable with Bayeselo.

Michel · Post by **Michel** » Sun Sep 30, 2012 9:35 am

I think we will switch to Ordo for our internal testing because of this problem, unless someone can convince me that this varying draw problem is solvable with Bayeselo.

The dependence of the results of BayesElo on the drawratio is only a second order effect. So I am surprised you are seeing these huge discrepancies.

I just did a little test where I let drawelo vary from 0 to 200 and it had hardly any effect on ratings.

Evert · Post by **Evert** » Sun Sep 30, 2012 11:17 am

Without wanting to get mixed up in discussions of which is better (I've never really looked into what the different rating programs are doing, so I really don't know), I will say that to me it sounds like you decide which program to use based on which of the two gives you the outcome you like most. This is a bad metric: the one you want is the one that tells you the most accurately which version is better.

Adam Hair · Post by **Adam Hair** » Sun Sep 30, 2012 12:35 pm

Michel wrote:
I think we will switch to Ordo for our internal testing because of this problem, unless someone can convince me that this varying draw problem is solvable with Bayeselo.
The dependence of the results of BayesElo on the drawratio is only a second order effect. So I am surprised you are seeing these huge discrepancies.

I just did a little test where I let drawelo vary from 0 to 200 and it had hardly any effect on ratings.

The default scaling will keep the ratings spread from changing very much as drawElo varies.

The equation for scale is:

scale = (4*10^(drawElo/400))/(1 + 10^(drawElo/400)^2)

Laskos · Post by **Laskos** » Sun Sep 30, 2012 1:50 pm

Evert wrote:Without wanting to get mixed up in discussions of which is better (I've never really looked into what the different rating programs are doing, so I really don't know), I will say that to me it sounds like you decide which program to use based on which of the two gives you the outcome you like most. This is a bad metric: the one you want is the one that tells you the most accurately which version is better.

No, the problem with Bayeselo is that it does not give the correct predictions or predictions which would obey the logistic model used by Bayeselo. Larry is right in raising these questions. I probably miss something, the algorithm seems all right, maybe something is broken in the scaling of the results.

Kai

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 30, 2012 2:04 pm

Perhaps the specific behavior I note was from using scale = 1. But more generally, I have grave doubts about the model underlying Bayeselo, which as HGM has explained pretty much assumes that one win plus one loss is like one draw, rather than like two draws.

No you are very wrong. If you want the one win and one loss equals two draws model bayeselo have it too now that I added it. But that barely makes a difference if you use the proper scale. Scale=1 is something others suggested and used in CCRL thinking they are smart. That is the problem use the default scale and you should have comparable results.

When the draw percentage is way above or below the default, I think this causes the rating differences to move away from what the elo system (or Ordo) would give (correct me if this is wrong). Maybe bayeselo would be fine if that was the rating system used by FIDE and everyone else, but since we expect rating differences to imply certain percentage scores (according to the elo system), if Bayeselo can give results that differ markedly from that, it becomes incompatible with elo ratings.

Actually that is why the scale was introduced but Ordo authors changed that and compared bayeselo. I have tested three models in bayeselo and their effects are minimial if you use the proper scale. Without scale one of them could give very large ratings. And don't think that this is wrong. It is not a must that 200 elo correspond to 75%. If you use draw ratio and whit advantage (which Ordo and elostat lack), then it would be eloDelta-eloAdv+eloDraw that will match 75%. This is the correct one. People screamed so much about this scale had to be introduced.

Also, it seems from what I observed (again, subject to scale = 1 being a possible cause) that Bayeselo inflates the rating differences of direct matches (perhaps due to higher draw rates in self-play), which in my view are already overstated even if rated "properly". So I think Ordo may be a better predictor of ratings on actual rating lists, especially if we are rating self-play matches.

I repeat if you use the default scale you get the same thing. If it has an inferior algorithm it CANT be better, simple. It is an insult to Remi who published so many papers with bayeselo,now that everyone thinks he is an expert and changes parameters at will.

Ordo vs. Bayeselo

Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo

Re: Ordo vs. Bayeselo