Ordo vs. Bayeselo

Discussion of chess software programming and technical issues.

Moderator: Ras

Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

Adam Hair wrote:
Daniel Shawul wrote:You have been told many times if you would listen. Scale is a post processing parameter just like offset is. If you had used default calculated scale by mm then you would find rating scales similar to what everyone expects. You have invented a non-existing problem like 'compression' and used 'scale=1' to 'solve' it. How can you compare a scaled result with a bare elo model and say it compresses it ??. Scale=1 means not using scales similar to setting offset=0 means not using offset. Will you compare the model with offset set at 1500 with the model and say bayeselo shifts ratings? No, that is ridiclous. But we are still using offsets when reporting results, and so we should use scales too. Without scales ratings will not be comparable. I tested with 3 different bayeselo draw models and they all match their corresponding models very well.
Just apologize for causing this chaos with 'scale=1' now that every one thinks it was the default. FUD at its best.
I apologize. I have been apologizing about this, all the while hoping for an explanation. Have you forgot about the earlier threads?

http://talkchess.com/forum/viewtopic.ph ... 68&t=44900

Will you apologize to me for misconstruing my intentions?
Then why do you still insist on scale=1 even in your reply to me somehwere in this thread. That is what I thought you belived and you were still arguing with Michel in the mean time.
I apologize if I misunderstood you but last time CCRL is still using scale=1 after your suggestion. CCRL,Sedat and Larry all think that is the default.
Anyway this is closed for me now.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

Adam Hair wrote:
Daniel Shawul wrote:
lkaufman wrote:I'd like to propose a related question. Let's say that we want to know whether A or B is stronger, i.e. which would win in a direct match. Version A scores 55% against a foreign gauntlet, B scores 56%, so B is 7 elo stronger according to normal Elo calculations and to Ordo. But let's say that A ends up rated 7 elo higher according to Bayeselo (which I believe can and does happen sometimes, due to differing draw rates and to which programs each scored better or worse against). Should you bet your money on A or on B in a direct match? Aside from just expressing opinions, does anyone have any data that would help answer this question?
The problem with your question is that you hope Ordo may bring improvement when it has inferior algorithms. It simply can't. Someone should first do analysis of what improvements if any Ordo brings. Remi did such comparison against state of the art ( EloStat at the time) when he first introduced bayeselo. http://remi.coulom.free.fr/Bayesian-Elo/ . The improvements of bayeselo are there for every one to see. Nothing like that from Ordo guys aside from spreading 'misconceptions' (now admitted by Adam) of bayeselo to look good. They know it is inferior so only chance is FUD (thanks Michel :))
And don't use scale=1.
I don't understand you :?
Well my reply was before I read your admission that you have apologized for spreading use of scale=1. But we are cool now.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Ordo vs. Bayeselo

Post by Adam Hair »

It is cool with me, too.
lkaufman
Posts: 6279
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Ordo vs. Bayeselo

Post by lkaufman »

Daniel Shawul wrote:
lkaufman wrote:I'd like to propose a related question. Let's say that we want to know whether A or B is stronger, i.e. which would win in a direct match. Version A scores 55% against a foreign gauntlet, B scores 56%, so B is 7 elo stronger according to normal Elo calculations and to Ordo. But let's say that A ends up rated 7 elo higher according to Bayeselo (which I believe can and does happen sometimes, due to differing draw rates and to which programs each scored better or worse against). Should you bet your money on A or on B in a direct match? Aside from just expressing opinions, does anyone have any data that would help answer this question?
The problem with your question is that you hope Ordo may bring improvement when it has inferior algorithms. It simply can't. Someone should first do analysis of what improvements if any Ordo brings. Remi did such comparison against state of the art ( EloStat at the time) when he first introduced bayeselo. http://remi.coulom.free.fr/Bayesian-Elo/ . The improvements of bayeselo are there for every one to see. Nothing like that from Ordo guys aside from spreading 'misconceptions' (now admitted by Adam) of bayeselo to look good. They know it is inferior so only chance is FUD (thanks Michel :))
And don't use scale=1.
EloStat was no good because it made the unsound assumption that you can average the ratings of opposing engines and get a meaningful number. Ordo (I believe) corrects that flaw. So does Bayeselo. The fact that both are clearly superior to EloStat does not give us any information on which of the two is superior. Bayeselo treats two draws very differently from a win and a loss; Ordo (I believe) does not. The question comes down to whether this different treatment, which is justified by a theoretical model, is actually justified with real-world data. It cannot be answered by abstract arguments, only by actually doing comparisons with real data. I'm asking whether anyone has attempted to do so. My hunch is that the Bayeselo assumption is less correct than the standard (Ordo) one, because I've seen strange results (both in real data and in simulations) for Bayeselo that seem wrong to me intuitively. But I'm perfectly willing to admit that I'm wrong if there is data to prove so.
I would also like to add that, based on HGM's explanation, if Bayeselo is superior to Ordo, it implies that the scoring system used in chess is wrong. I'm just not sure what scoring system for wins, draws, and losses would be consistent (or most nearly so) with Bayeselo. Does anyone know?
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

lkaufman wrote:
Daniel Shawul wrote:
lkaufman wrote:I'd like to propose a related question. Let's say that we want to know whether A or B is stronger, i.e. which would win in a direct match. Version A scores 55% against a foreign gauntlet, B scores 56%, so B is 7 elo stronger according to normal Elo calculations and to Ordo. But let's say that A ends up rated 7 elo higher according to Bayeselo (which I believe can and does happen sometimes, due to differing draw rates and to which programs each scored better or worse against). Should you bet your money on A or on B in a direct match? Aside from just expressing opinions, does anyone have any data that would help answer this question?
The problem with your question is that you hope Ordo may bring improvement when it has inferior algorithms. It simply can't. Someone should first do analysis of what improvements if any Ordo brings. Remi did such comparison against state of the art ( EloStat at the time) when he first introduced bayeselo. http://remi.coulom.free.fr/Bayesian-Elo/ . The improvements of bayeselo are there for every one to see. Nothing like that from Ordo guys aside from spreading 'misconceptions' (now admitted by Adam) of bayeselo to look good. They know it is inferior so only chance is FUD (thanks Michel :))
And don't use scale=1.
EloStat was no good because it made the unsound assumption that you can average the ratings of opposing engines and get a meaningful number. Ordo (I believe) corrects that flaw.
Ordo did not correct that error bayeselo did as clearly pointed out in the homepage. So give credit where credit is due! Yes Ordo has invented the wheal yet again, but it did not give you improvements. Why wouldn't the author give a clear information if there are any improvements?
So does Bayeselo. The fact that both are clearly superior to EloStat does not give us any information on which of the two is superior. Bayeselo treats two draws very differently from a win and a loss; Ordo (I believe) does not. The question comes down to whether this different treatment, which is justified by a theoretical model, is actually justified with real-world data. It cannot be answered by abstract arguments, only by actually doing comparisons with real data.
You are making way too many uniformed assumptions and conclusions and probably consider your self as an expert now. News flash you are not and please take no offence. You say that the assumption of 1 win and 1 loss was not set as 2 draws has an effect but I have have tested that and another model and it barely has an effect. CCRL blitz, CCRL 40/40 and CEGT data were used
Image
I'm asking whether anyone has attempted to do so. My hunch is that the Bayeselo assumption is less correct than the standard (Ordo) one, because I've seen strange results (both in real data and in simulations) for Bayeselo that seem wrong to me intuitively. But I'm perfectly willing to admit that I'm wrong if there is data to prove so.
Yes you are wrong. You claim way too many things you don't understand. Your problem is staring you at the face scale = 1. If you don't like draw ration, you can get rid of it by using mm 0 0... So what now? It should be the same as Ordo and Elostat which don't have it.
I would also like to add that, based on HGM's explanation, if Bayeselo is superior to Ordo, it implies that the scoring system used in chess is wrong. I'm just not sure what scoring system for wins, draws, and losses would be consistent (or most nearly so) with Bayeselo. Does anyone know?
Too many questions. First make sure you understand you shouldn't use scale=1. I don't know what HGM explanation you are talking about but I am sure he said nothing close to "bayeselo scoring system is wrong" or to that effect.
lkaufman
Posts: 6279
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Ordo vs. Bayeselo

Post by lkaufman »

Daniel Shawul wrote: Yes you are wrong. You claim way too many things you don't understand. Your problem is staring you at the face scale = 1. If you don't like draw ration, you can get rid of it by using mm 0 0... So what now? It should be the same as Ordo and Elostat which don't have it.


Using mm 0 0 cannot possibly make bayeselo the same as Ordo and Elostat, because they are very different from each other, due to the wrong use of averaging of ratings by Elostat. Probably you mean to say that mm 0 0 produces the same ratings as Ordo. Is that actually true? If so then maybe that's what I might favor. I'm not expressing a preference for the Ordo software over Bayeselo, only questioning the effect of the Bayeselo draw handling.
I would also like to add that, based on HGM's explanation, if Bayeselo is superior to Ordo, it implies that the scoring system used in chess is wrong. I'm just not sure what scoring system for wins, draws, and losses would be consistent (or most nearly so) with Bayeselo. Does anyone know?
Too many questions. First make sure you understand you shouldn't use scale=1.

Yes, I now understand this, thanks to your posts. But some scaling is needed to make the Bayeselo ratings similar in scale to real elo ratings. Perhaps just multiply all ratings by 400/340? Is it that simple?

I don't know what HGM explanation you are talking about but I am sure he said nothing close to "bayeselo scoring system is wrong" or to that effect.
No, he didn't say that, he explained that bayeselo makes a very different assumption than normal elo, namely that two draws don't equal one win and one loss, but (at least in close matchups) one draw equals one win and one loss. This assumption, however justified theoretically, seems very questionable to me, and needs to be proven practically. Is there any data to support or refute this radical idea?
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

No, he didn't say that, he explained that bayeselo makes a very different assumption than normal elo, namely that two draws don't equal one win and one loss, but (at least in close matchups) one draw equals one win and one loss. This assumption, however justified theoretically, seems very questionable to me, and needs to be proven practically. Is there any data to support or refute this radical idea?
Are you even reading what I write?? For the third time, I tested three draw models including the default one and found no significant difference with the results. You pick up ideas from here and there and you think you know better than Remi ? Ok here is a link to a paper that he started writing and I sort of did some tests on CCRL/CEGT data. http://www.grappa.univ-lille3.fr/~coulo ... tcomes.pdf . Try to understand it and get some reality check before you open your mouth.
The Davidson model (1 win + 1 loss = 2 draws) was slightly better but I had a _hard time_ proving it fits the data better. Can you tell from the plots I showed you if one of the models is better? The davidson model btw needs a big scaling as you can see the red plot is far from the other two models including the default model. So you wouldn't like it if I give you that model since you use scale=1 anyway. I doubt if you understand what I am saying and will not be surprized if you ask for data yet again...
lkaufman
Posts: 6279
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Ordo vs. Bayeselo

Post by lkaufman »

Daniel Shawul wrote:
No, he didn't say that, he explained that bayeselo makes a very different assumption than normal elo, namely that two draws don't equal one win and one loss, but (at least in close matchups) one draw equals one win and one loss. This assumption, however justified theoretically, seems very questionable to me, and needs to be proven practically. Is there any data to support or refute this radical idea?
Are you even reading what I write?? For the third time, I tested three draw models including the default one and found no significant difference with the results. You pick up ideas from here and there and you think you know better than Remi ? Ok here is a link to a paper that he started writing and I sort of did some tests on CCRL/CEGT data. http://www.grappa.univ-lille3.fr/~coulo ... tcomes.pdf . Try to understand it and get some reality check before you open your mouth.
The Davidson model (1 win + 1 loss = 2 draws) was slightly better but I had a _hard time_ proving it fits the data better. Can you tell from the plots I showed you if one of the models is better? The davidson model btw needs a big scaling as you can see the red plot is far from the other two models including the default model. So you wouldn't like it if I give you that model since you use scale=1 anyway. I doubt if you understand what I am saying and will not be surprized if you ask for data yet again...
According to your own words, the normal assumption (Davidson) was "slightly better", if not signficantly so. I'm not familiar with how that model differs from normal elo (i.e. Ordo), but the draw assumption is the same. So I have no idea what the significance of scaling is on that model. But I was mainly interested in whether the key assumption of Bayeselo, namely the draw handling, has been shown to be better than the alternate one, and unless I misunderstand you the answer is "no".
My math background, although in the top 1% of the population, is not nearly as sophisticated as yours or Remi or HGM. I am not questioning Remi's math or his fine work at all, but only the underlying ASSUMPTION behind Bayeselo, which has to be proven empirically to be better than the standard assumption that it doesn't matter how you score your points (i.e. draws are 0.5, wins 1, losses 0). If it is not better then there is no point to the added complexity of Bayeselo necessary to accommodate draws, or to the failure to match normal elo results. I am on the USCF rating committee and have brought up the idea behind Bayeselo, but unless there is evidence that its assumption is superior to the norm I can't make a case for it.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Ordo vs. Bayeselo

Post by Daniel Shawul »

According to your own words, the normal assumption (Davidson) was "slightly better", if not signficantly so. I'm not familiar with how that model differs from normal elo (i.e. Ordo), but the draw assumption is the same. So I have no idea what the significance of scaling is on that model. But I was mainly interested in whether the key assumption of Bayeselo, namely the draw handling, has been shown to be better than the alternate one, and unless I misunderstand you the answer is "no".
Weren't you looking for aa 40% or so differences ?? You made an obvious mistake of using scale = 1 so that explains your problem. The rest of stuff you are saying is from bits and peices you hear from here and there. You did not question the used draw model because Remi have started the comparison himself months ago. I did the testing months ago. So no we give you 0 credit for that and if there are improvements from the draw model used it is not much. You simply opened your mouth where you don't have no expertise. Just accept that scale=1 will fix your inflation problem.
My math background, although in the top 1% of the population, is not nearly as sophisticated as yours or Remi or HGM. I am not questioning Remi's math or his fine work at all, but only the underlying ASSUMPTION behind Bayeselo,
Sorry that has been thought of and tested way before you suggest it here, so don't try to take the credit for suggesting that. Nor should Ordo's author do that because they didn't suggest that as an improvement of Ordo over bayeselo. Remi started testing that himself as I shown you in the paper. I did the test and found the effect is minimal and certainly can't explain the inflated rating you observed. Michel even tested up to drawelo of 200 elo and didn't observe significant differences so I am not sure of what you observed...
which has to be proven empirically to be better than the standard assumption that it doesn't matter how you score your points (i.e. draws are 0.5, wins 1, losses 0). If it is not better then there is no point to the added complexity of Bayeselo necessary to accommodate draws, or to the failure to match normal elo results. I am on the USCF rating committee and have brought up the idea behind Bayeselo, but unless there is evidence that its assumption is superior to the norm I can't make a case for it.
Sorry but I say you are not qualified enough to make such assessments for USCF. You are undoubtedly a good chess player but it needs a good statiscian to discern the differences of 1 win + 1 loss = 1 draw assumption from 1 win + 1 loss = 2 draws.
lkaufman
Posts: 6279
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Ordo vs. Bayeselo

Post by lkaufman »

Daniel Shawul wrote:
According to your own words, the normal assumption (Davidson) was "slightly better", if not signficantly so. I'm not familiar with how that model differs from normal elo (i.e. Ordo), but the draw assumption is the same. So I have no idea what the significance of scaling is on that model. But I was mainly interested in whether the key assumption of Bayeselo, namely the draw handling, has been shown to be better than the alternate one, and unless I misunderstand you the answer is "no".
Weren't you looking for aa 40% or so differences ?? You made an obvious mistake of using scale = 1 so that explains your problem. The rest of stuff you are saying is from bits and peices you hear from here and there. You did not question the used draw model because Remi have started the comparison himself months ago. I did the testing months ago. So no we give you 0 credit for that and if there are improvements from the draw model used it is not much. You simply opened your mouth where you don't have no expertise. Just accept that scale=1 will fix your inflation problem.


Yes, and you identified the cause as scale = 1, thank you. Without using scale = 1 we have the opposite problem, 15% deflation. Is the cure for that simply to multiply by 40/34? If so why isn't that the default?
I'm not looking for "credit", only answers. Given the choice between
using software fully consistent with FIDE/ELO ratings and software that uses an entirely different algorithm, there needs to be evidence of clear superiority of the new algorithm to justify using it. From what I have gleaned, at best it is a toss-up. It is much "cleaner" to use a model that implies that a higher score (in the normal chess scoring sense) means a higher rating, which is not always true with Bayeselo. Still, I would use it if the underlying model were proven to be clearly superior.
My math background, although in the top 1% of the population, is not nearly as sophisticated as yours or Remi or HGM. I am not questioning Remi's math or his fine work at all, but only the underlying ASSUMPTION behind Bayeselo,
Sorry that has been thought of and tested way before you suggest it here, so don't try to take the credit for suggesting that. Nor should Ordo's author do that because they didn't suggest that as an improvement of Ordo over bayeselo. Remi started testing that himself as I shown you in the paper. I did the test and found the effect is minimal and certainly can't explain the inflated rating you observed. Michel even tested up to drawelo of 200 elo and didn't observe significant differences so I am not sure of what you observed...

Again, not interested in "credit". I merely pointed out that, at least when using scale = 1, the rating inflation can be enormous given direct matches with high draw percentages. I accept your solution, don't use scale = 1.

which has to be proven empirically to be better than the standard assumption that it doesn't matter how you score your points (i.e. draws are 0.5, wins 1, losses 0). If it is not better then there is no point to the added complexity of Bayeselo necessary to accommodate draws, or to the failure to match normal elo results. I am on the USCF rating committee and have brought up the idea behind Bayeselo, but unless there is evidence that its assumption is superior to the norm I can't make a case for it.
Sorry but I say you are not qualified enough to make such assessments for USCF. You are undoubtedly a good chess player but it needs a good statiscian to discern the differences of 1 win + 1 loss = 1 draw assumption from 1 win + 1 loss = 2 draws.
I am only one member of the committee. The Chairman and a couple other members do have the necessary statistical background to make this determination. My role would be to propose something and to point to the evidence that might justify it. Since the two assumptions in question are so radically different, I find it surprising that it is so hard to determine which is superior. I would have thought someone would have answered the question definitively by now, but apparently not. Unless there is such evidence in favor of the Bayeselo assumption, no one would even consider complicating things by treating draws differently than the scoring system does.
I would still love to see an answer to the question, "Is there an alternate scoring system for wins, losses, and draws that is consistent with the assumption underlying Bayeselo"? Perhaps Remi or HGM could answer this. Maybe there is no such scoring system possible, I don't know.