Throwing out draws to calculate Elo
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 11495
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
The MT is the best PRNG for noncrypto work.
The reason is that almost all pseudo random number generators will fragment into planes when you have multiple dimentions. That is why it is not the basis PRNG for many programming languages.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC285899/
The reason is that almost all pseudo random number generators will fragment into planes when you have multiple dimentions. That is why it is not the basis PRNG for many programming languages.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC285899/
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

 Posts: 11495
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
If I grab an entry from the database, chosen at random, how close is it to 0.5?
This data is from testing a perfectly fair penny flipper against itself, and so it demonstrates the noise in testing against two engines of identical strength. We are asking the question, "Will the LOS agorithm correctly diagnose the LOS as "not superior" if the two engines have identical strength?
This data is from testing a perfectly fair penny flipper against itself, and so it demonstrates the noise in testing against two engines of identical strength. We are asking the question, "Will the LOS agorithm correctly diagnose the LOS as "not superior" if the two engines have identical strength?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

 Posts: 111
 Joined: Sun Dec 25, 2016 3:59 pm
Re: Throwing out draws to calculate Elo
Well, before we get wrapped up in the shifting goal posts, let's be clear about something. My previous post was directed towards the claim that the distribution of LOS scores changes (specifically, shifts towards values further from 0.5) as you play more games.
The posted query output shows that the distribution is (to some sensible level of precision) exactly as we'd expect. Are you really claiming that you just happened to post data from the one match length where the distribution is miraculously as we'd expect for matches between two exactly equal "engines"?
If, instead, that is just the distribution of LOS scores in matches between exactly equal "engines" (which, if you give it a little thought, isn't surprising since the metric is based on the likelihood of getting results as extreme as the observed results from two equal engines), then the answer to your new question will also, on average, be the same regardless of match length.
None of this is even a little surprising given what the metric is.
This is all a bit like abusing pvalues, and then deciding they're completely useless measurements because you realize that you can't actually infer anything about the probability of your alternate hypothesis from pvalues alone.
My main interest in all of this (after the first couple pages) was to see if your data actually did indicate the things you claimed. Now that I see it doesn't, my interest is a bit diminished.
You clearly don't like some of the properties of LOS. That's fine. There's a rather large difference between your disliking that you can't use a measurement in some way (here that the probability of a measured LOS "near" 0.5 doesn't approach 100% for matches of increasing length between exactly equal engines), and claiming that it is fundamentally flawed and that people who disagree with you aren't thinking people.
Cheers!
The posted query output shows that the distribution is (to some sensible level of precision) exactly as we'd expect. Are you really claiming that you just happened to post data from the one match length where the distribution is miraculously as we'd expect for matches between two exactly equal "engines"?
If, instead, that is just the distribution of LOS scores in matches between exactly equal "engines" (which, if you give it a little thought, isn't surprising since the metric is based on the likelihood of getting results as extreme as the observed results from two equal engines), then the answer to your new question will also, on average, be the same regardless of match length.
None of this is even a little surprising given what the metric is.
This is all a bit like abusing pvalues, and then deciding they're completely useless measurements because you realize that you can't actually infer anything about the probability of your alternate hypothesis from pvalues alone.
My main interest in all of this (after the first couple pages) was to see if your data actually did indicate the things you claimed. Now that I see it doesn't, my interest is a bit diminished.
You clearly don't like some of the properties of LOS. That's fine. There's a rather large difference between your disliking that you can't use a measurement in some way (here that the probability of a measured LOS "near" 0.5 doesn't approach 100% for matches of increasing length between exactly equal engines), and claiming that it is fundamentally flawed and that people who disagree with you aren't thinking people.
Cheers!
Last edited by MonteCarlo on Thu Jul 02, 2020 10:47 pm, edited 2 times in total.

 Posts: 11495
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
Yes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).
At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).
At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: Throwing out draws to calculate Elo
Hi Dann!Dann Corbit wrote: ↑Thu Jul 02, 2020 10:43 pmYes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).
At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.
The thing is that the enormous number of data points that are draws does not say anything about which of the players is better. And if it says something it actually says the opposite of what your intuition is telling you. As I have said in some other posts having lots of draws actually make the wins count more if you think not all draws are equal, i.e if you think that some draws are closer to wins than other draws and the reason is that if you assume that the players with no wins is better it would be highly unlikely he would have played so many draws without any wins.
/Pio

 Posts: 11495
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
This is a really interesting idea and I will have to think long and hard about it.Pio wrote: ↑Thu Jul 02, 2020 11:11 pmHi Dann!Dann Corbit wrote: ↑Thu Jul 02, 2020 10:43 pmYes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).
At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.
The thing is that the enormous number of data points that are draws does not say anything about which of the players is better. And if it says something it actually says the opposite of what your intuition is telling you. As I have said in some other posts having lots of draws actually make the wins count more if you think not all draws are equal, i.e if you think that some draws are closer to wins than other draws and the reason is that if you assume that the players with no wins is better it would be highly unlikely he would have played so many draws without any wins.
/Pio
My difficulty is this:
If my opponent is better than me, it is difficult even to achieve a draw, especially if he/she/they/it are a lot better than me.
And so, if I see one hundred draws, that seems to be sending me a big signal of "Equality, equality, equality..." and so collecting an enormous amount of this kind of data indicates equality to me. On the other hand, I must admit that the Monte Hall problem and the Birthday paradox were hard for me to understand until I really understood properly the math behind them. But I cannot accept it until I understand it. If I am wrong, I do hope to somehow understand why the equality data does not matter. My problem is that I suspect the model (not the math). So even though the math works, I do not feel convinced that it is right.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: Throwing out draws to calculate Elo
There is no error bar in 1/256.Dann Corbit wrote: ↑Thu Jul 02, 2020 5:19 amI partly agree with it.syzygy wrote: ↑Thu Jul 02, 2020 12:40 amInstead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.
We run a match until we have 8 decided games.
The match results in N games, i.e. N8 drawn games and 8 decided games It turns out that A has won all decided games.
What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.
Do you agree? (I would hope you do.)
Does any of this depend on the value of N?
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.
But I will repeat my question, which you did not answer: does any of this depend on N? On the number of draws?
Re: Throwing out draws to calculate Elo
I mean exactly what I write. I don't care whether my hypothesis applies to your two favorite engines. I am just trying to make it understandable that one engine can be clearly superior to another engine, yet almost equal in strength.Dann Corbit wrote: ↑Thu Jul 02, 2020 4:59 amIf you run the experiment twice, that is not enough.syzygy wrote: ↑Thu Jul 02, 2020 12:30 amAnd if engine A draws engine B 99.99999999% of the time and beats engine B the remaining 0.0000001% of the time, would you agree that A is superior?Dann Corbit wrote: ↑Wed Jul 01, 2020 11:42 pmNo, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.syzygy wrote: ↑Wed Jul 01, 2020 9:42 pmNo, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.Dann Corbit wrote: ↑Wed Jul 01, 2020 12:08 amNice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not
If these numbers can be established with 100% certainty, would you agree that the LOS is 1?
You seem to think that an engine emitting a win is deterministic. It is not.
Maybe you are now admitting that this is possible, but in the beginning of this thread you certainly were heavily denying that that made sense.
So maybe there is progress?

 Posts: 11495
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
The error from one million data points is not the same as the error from ten data points.syzygy wrote: ↑Thu Jul 02, 2020 11:42 pmThere is no error bar in 1/256.Dann Corbit wrote: ↑Thu Jul 02, 2020 5:19 amI partly agree with it.syzygy wrote: ↑Thu Jul 02, 2020 12:40 amInstead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.
We run a match until we have 8 decided games.
The match results in N games, i.e. N8 drawn games and 8 decided games It turns out that A has won all decided games.
What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.
Do you agree? (I would hope you do.)
Does any of this depend on the value of N?
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.
But I will repeat my question, which you did not answer: does any of this depend on N? On the number of draws?
If I have 10 data points, I will be surprised very much by three sports.
If I have one million data points, I will be surprised if there are not three sports.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: Throwing out draws to calculate Elo
So I repeat my question again. Does the outcome 1/256 depend on the value of N? On the number of draws?Dann Corbit wrote: ↑Fri Jul 03, 2020 12:05 amThe error from one million data points is not the same as the error from ten data points.syzygy wrote: ↑Thu Jul 02, 2020 11:42 pmThere is no error bar in 1/256.Dann Corbit wrote: ↑Thu Jul 02, 2020 5:19 amI partly agree with it.syzygy wrote: ↑Thu Jul 02, 2020 12:40 amInstead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.
We run a match until we have 8 decided games.
The match results in N games, i.e. N8 drawn games and 8 decided games It turns out that A has won all decided games.
What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.
Do you agree? (I would hope you do.)
Does any of this depend on the value of N?
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.
But I will repeat my question, which you did not answer: does any of this depend on N? On the number of draws?
If I have 10 data points, I will be surprised very much by three sports.
If I have one million data points, I will be surprised if there are not three sports.