Throwing out draws to calculate Elo

hgm · Post by **hgm** » Sun Jul 05, 2020 1:41 pm

Dann Corbit wrote: ↑Sun Jul 05, 2020 1:06 pmyou will notice that there are about 1000 items in each bin, except for the first and last bins, which have about 500 each.
That means that it is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure A is stronger than B" to "I am absolutely sure that A is not stronger than B" We can also turn it around and say the same thing in the other direction.

It is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure B is stronger than A" to "I am absolutely sure that B is not stronger than A" .

I offered this data last week in table form and even as a relational database table, but nobody seemed very interested.

Well, I suppose everyone knew this, and you are just trying to batter down an open door.

The point is that, exactly because this is true, the likelihood that the LOS between equal engines is >99% is 98 times smaller than that the LOS is between 1% and 99%. Because that ae 98 bins, and all bins get equal numbers of hits. So if you get a LOS of 0.99+, you are either dealing with a 1-in-a-hundred fluke, or there is some other reason.

Anyway, you questioned why I would use a LOS algorithm to test superiority of an engine over itself. I mean, it does sound kind of silly because we already know the answer, "It's not superior to itself." but that is exactly why the experiment is important. If the algorithm claims that the program is superior to itself when it is not superior to itself, then that indicates a problem.
What we see here is that the LOS algorithm coughs up an answer that is bad most of the time.
That is because as we have more and more trials, the wins and losses of the same engine do move towards the mean. However, the raw number of wins and losses that are not exactly on the mean will increase in spread (even though, on average, they compute a better mean). This destroys the LOS calculation.

I don't think any criticism you are doing this is justified; it is a perfectly valid test for the method. It is just that the results you get from this do not reflect badly in any way against the method. They might reflect very badly on you interpretation of the numbers that the method provides, but that is quite another issue.

Now, the LOS calculation is not the worst calculation in the world. It also tells us the same thing that common sense tells us. If engine A has more wins than engine B, it is probably stronger. But because it does not care about draws is is missing important information (including the number of games in total).

Again this weird belief that draws would be able to tell you anything about what happens in the other cases. You still did not tell whether you also have such delusions in other areas of life. If I tell you that 99% of the people that use XBoard only use it for Chess. Do you think that this affects whether the other 1% use it for Shogi or Xiangqi? How do you think the ratio of the number of Xiangqi and Shogi users must be affected if more people started to use XBoard for Chess?

Another important reason for testing LOS using an engine against itself is that the only thing I have ever seen it used for is for a tiebreaker. For instance CCRL uses it to tell differences between adjacent engines on their list. That makes them (by definition of the ordered list) fairly close in strength to each other.

In principle they could provide a matrix with the LOS between every pair of engines. But it is of course most interesting for engines that are close. Anyone would be able to guess that Rybka is almost certainly stronger than Fairy-Max. The point is that error bars of the Elo do not give the full story; you would have to know the covariances between all pairs as well to compare ratings. The LOS gives you results of that without you having to do the calculation.

Also note that LOS is partially transitive: if both A > B and B > C with high LOS, you can be sure that A > C with even better LOS. The opposit is not true: if both LOS are close to 50%, the LOS between A and C can still be very large, when B is just poorly determined w.r.t. the pair of them, but happens to fall in between.

It may be that with only a couple thousand games the error spread has not grown large enough to dominate. And so the answers may be OK. But I also think it is faulty for throwing out draw numbers. Draw numbers impart vital information about strength (as demonstrated by the Elo calculation) and therefore tossing out that information makes the algorithm more prone to bad guesses.

Draws do not provide any evidence for which one is stronger. You are the only one that insists that when the fraction of draws is large, the propensity for winning should be as large as that for losing. The rest of the world believes these to be completely independent properties.

duncan · Post by **duncan** » Sun Jul 05, 2020 6:06 pm

hgm wrote: ↑Sat Jul 04, 2020 1:48 pm It is difficult to deduce what his problem is, as what he writes is self-contradicory. He doesn't disagree with the number, but with its value... What the heck does that mean???

I assume it just means the model is wrong. So while in theory it is 1/256. In practise it is much lower.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 7:04 pm

hgm wrote: ↑Sun Jul 05, 2020 1:41 pm
Dann Corbit wrote: ↑Sun Jul 05, 2020 1:06 pmyou will notice that there are about 1000 items in each bin, except for the first and last bins, which have about 500 each.
That means that it is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure A is stronger than B" to "I am absolutely sure that A is not stronger than B" We can also turn it around and say the same thing in the other direction.

It is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure B is stronger than A" to "I am absolutely sure that B is not stronger than A" .

I offered this data last week in table form and even as a relational database table, but nobody seemed very interested.
Well, I suppose everyone knew this, and you are just trying to batter down an open door.

The point is that, exactly because this is true, the likelihood that the LOS between equal engines is >99% is 98 times smaller than that the LOS is between 1% and 99%. Because that ae 98 bins, and all bins get equal numbers of hits. So if you get a LOS of 0.99+, you are either dealing with a 1-in-a-hundred fluke, or there is some other reason.

Anyway, you questioned why I would use a LOS algorithm to test superiority of an engine over itself. I mean, it does sound kind of silly because we already know the answer, "It's not superior to itself." but that is exactly why the experiment is important. If the algorithm claims that the program is superior to itself when it is not superior to itself, then that indicates a problem.
What we see here is that the LOS algorithm coughs up an answer that is bad most of the time.
That is because as we have more and more trials, the wins and losses of the same engine do move towards the mean. However, the raw number of wins and losses that are not exactly on the mean will increase in spread (even though, on average, they compute a better mean). This destroys the LOS calculation.
I don't think any criticism you are doing this is justified; it is a perfectly valid test for the method. It is just that the results you get from this do not reflect badly in any way against the method. They might reflect very badly on you interpretation of the numbers that the method provides, but that is quite another issue.

Now, the LOS calculation is not the worst calculation in the world. It also tells us the same thing that common sense tells us. If engine A has more wins than engine B, it is probably stronger. But because it does not care about draws is is missing important information (including the number of games in total).
Again this weird belief that draws would be able to tell you anything about what happens in the other cases. You still did not tell whether you also have such delusions in other areas of life. If I tell you that 99% of the people that use XBoard only use it for Chess. Do you think that this affects whether the other 1% use it for Shogi or Xiangqi? How do you think the ratio of the number of Xiangqi and Shogi users must be affected if more people started to use XBoard for Chess?

Another important reason for testing LOS using an engine against itself is that the only thing I have ever seen it used for is for a tiebreaker. For instance CCRL uses it to tell differences between adjacent engines on their list. That makes them (by definition of the ordered list) fairly close in strength to each other.
In principle they could provide a matrix with the LOS between every pair of engines. But it is of course most interesting for engines that are close. Anyone would be able to guess that Rybka is almost certainly stronger than Fairy-Max. The point is that error bars of the Elo do not give the full story; you would have to know the covariances between all pairs as well to compare ratings. The LOS gives you results of that without you having to do the calculation.

Also note that LOS is partially transitive: if both A > B and B > C with high LOS, you can be sure that A > C with even better LOS. The opposit is not true: if both LOS are close to 50%, the LOS between A and C can still be very large, when B is just poorly determined w.r.t. the pair of them, but happens to fall in between.

It may be that with only a couple thousand games the error spread has not grown large enough to dominate. And so the answers may be OK. But I also think it is faulty for throwing out draw numbers. Draw numbers impart vital information about strength (as demonstrated by the Elo calculation) and therefore tossing out that information makes the algorithm more prone to bad guesses.
Draws do not provide any evidence for which one is stronger. You are the only one that insists that when the fraction of draws is large, the propensity for winning should be as large as that for losing. The rest of the world believes these to be completely independent properties.

Looking at the start of the list:

Code: Select all

pct,count(pct)
100,456
99,996
98,1056

The third number in the list means this:
the count of elements where ( round(los_prob * 100,0)) = 98 is 1056

What the data shows is that when two engines are equal and 100,000 measurements have been made, it is equally likely that the LOS function will return any number between 0 and 1. It has become, in fact, a uniform random number generator. We know the engines are equal and with 100,000 trials one would think we would get a very accurate answer to the qustion. Certainly Elo would give a very accurate answer.
The LOS algorithm returns .97 exactly as often as it returns 0.5 and exactly as often as it returns every other number between 0 and 1.
This is due to the randomness inherent in data and the way that the LOS algorithm works.

Elo and LOS are measuring approximately the same thing. After all, what is an Elo? It is a guy's name. What do the units means?
If we have a difference in Elo between two members of the same pool this will translate into a probability of winning.
In fact, Elo is nothing more than a more sophisticated and accurate answer to the question, "who is stronger?"

So now, let's look at the problem from a more realistic dataset with respect to chess. After all, when two equally matched chess engines play each other, especially when they are very strong, we see mostly draws. So instead of the 0 for draws (because the penny never lands on the edge of 0.500... probability) let's see what happends if we add half a million draws to each measurement so that we have 600,000 games in the set now, with 6 out of 7 games being draws. Of course LOS, which does not care about draws will not change its opinion about who might be stronger, so we get back the same set of uniform random numbers between 0 and 1 for LOS. But what about Elo? Now the extreme tail ends have an Elo difference well under 1 Elo. For example:
losses: 49373 wins: 50627 ties: 500000 LOS: 0.999963 Elo diff: 0.396075
...
losses: 50781 wins: 49219 ties: 500000 LOS: 3.9166e-07 Elo diff: -0.493357

So Elo values would say that even at the exteme ends of the tail, the engines are about the same strength (within less than 1/2 Elo difference even at the extreme tail ends of the curve). So Elo ALWAYS gets the right answer. The two engines have the same strength.
The LOS (of course) has not changed its expectation based on the new information. It is still delivering the wrong answer almost all the time.

IOW, if we have a large number of trials between two engines of identical strength and we ask the Elo algorithm which is stronger, Elo always gets the right answer. LOS, on the other hand, is a blind monkey, groping for bananas in the pitch black darnkess.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 7:23 pm

And now, for your dining and dancing pleasure:

The point being that values close to the true mean of 0.5 should be at least more probable (one would think) because we are trying to find out if the engines have the same strength or not.

And lest you think I have cherry picked the data, the entire list was posted, so pick your own pairs.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 8:14 pm

That should be "5 out of 6 games being draws"
instead of
"6 out of 7 games being draws"

hgm · Post by **hgm** » Sun Jul 05, 2020 8:21 pm

Dann Corbit wrote: ↑Sun Jul 05, 2020 7:04 pmLooking at the start of the list:
Code: Select all
pct,count(pct)
100,456
99,996
98,1056
The third number in the list means this:
the count of elements where ( round(los_prob * 100,0)) = 98 is 1056

What the data shows is that when two engines are equal and 100,000 measurements have been made, it is equally likely that the LOS function will return any number between 0 and 1. It has become, in fact, a uniform random number generator. We know the engines are equal and with 100,000 trials one would think we would get a very accurate answer to the qustion.

You do. But you are asking the wrong question. That is YOUR fault, not the fault of the method to answer it.

You ask which of the two engines is stronger, irrespective of how much stronger. For equal engines that is of course undecidable; no matter how many games you play, and how close to equality the results are, there is always the possibility that there is a difference between them that was too small to measure for the number of games you did. So the question will remain undecidable forever. And this is what the LOS distribution tells you. If it would ever say "it is now proven these engines are exactly equal" it would be flawed, because it would be lying: it is impossible to prove such a thing empirically. You can only prove they are different.

So you can only expect convergence of the LOS when there is a difference. Try the same thing putting the trheshold at 0.5001 instead of 0.5, and then look how the LOS is distributed.

Certainly Elo would give a very accurate answer.

It would give exactly the same answer. This is another one of your delusions. You have all kinds of misconceptions about Elo calculations, which you then substitute for the real calculations. If you would have put a run from each bin in the Elo calculator, you would have seen that, depending on the bin, you would get Elo's with a difference that can ly far outside each other's error bars, in one direction or the other.

Elo and LOS are measuring approximately the same thing.

No, Elo measures a different thing: it measures the strength on a quantitative scale. The answer to the question which is stronger is indirectly derived from that. For two engines you can directly see it from the error bars on the ratings, and the LOS provides no extra information. For more engines you would not only need error bars but also covariances, and the Elo picture gets rather obfuscated; the LOS then summarizes all that in a single number.

So now, let's look at the problem from a more realistic dataset with respect to chess. After all, when two equally matched chess engines play each other, especially when they are very strong, we see mostly draws. So instead of the 0 for draws (because the penny never lands on the edge of 0.500... probability) let's see what happends if we add half a million draws to each measurement so that we have 600,000 games in the set now, with 6 out of 7 games being draws. Of course LOS, which does not care about draws will not change its opinion about who might be stronger, so we get back the same set of uniform random numbers between 0 and 1 for LOS. But what about Elo? Now the extreme tail ends have an Elo difference well under 1 Elo. For example:
losses: 49373 wins: 50627 ties: 500000 LOS: 0.999963 Elo diff: 0.396075
...
losses: 50781 wins: 49219 ties: 500000 LOS: 3.9166e-07 Elo diff: -0.493357

1) Your numbers are suspect. I always use the rule of thumb "1% = 7 Elo". Your first result has 627 extra wins out of 600K games, or slightly more than 0.1%. That should give an Elo difference of slightly more than 0.7. But you quote only 0.39. I suspect that you are cheating by a factor 2 in your favor, here.
2) You forgot to report the error bars. For an 83% draw rate the standard deviation of the score should be 20%/sqrt(N), or for 600K games 0.025%, which with my rule of thumb translates to 0.18 Elo. The Elo difference was 0.7 Elo, 4 times larger than the standard deviation. The Elo thus says that the winning engine was almost certainly stronger.

You have no point. The experiment you did refutes everything you claim, and confirms everything the rest of the world says.

So Elo values would say that even at the exteme ends of the tail, the engines are about the same strength (within less than 1/2 Elo difference even at the extreme tail ends of the curve).

And it said that one was almost certainly a bit stronger than the other.

So Elo ALWAYS gets the right answer.

No, that was the wrong answer. They were equally strong.

The two engines have the same strength.

So that the Elo calculation said that one was almost certainly stronger was the wrong answer.

The LOS (of course) has not changed its expectation based on the new information. It is still delivering the wrong answer almost all the time.

It is still delivering the same answer as the Elo. Which is correct most of the time, namely that the result is "too close to call" with any confidence. Only for extreme flukes, like those you cherry-pick, the Elo and the LOS give the wrong impression.

IOW, if we have a large number of trials between two engines of identical strength and we ask the Elo algorithm which is stronger, Elo always gets the right answer. LOS, on the other hand, is a blind monkey, groping for bananas in the pitch black darnkess.

No, no, and no. Wrong on all counts. Keep trying.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 8:26 pm

[Moderation]

Due to a stupid error I obliterated Dann's message. My sincerest aplogies. By the quotes in the post below you will still be able to see most of it.

H.G.

hgm · Post by **hgm** » Sun Jul 05, 2020 8:44 pm

You are claiming that when two engines have the same strength, the strength difference is undecidable. And yet Elo got the right answer (same stength) every single time, including the extreme tails.

Read again. And keep re-reading it until you understand what was written, instead of replacing it by what you think must be true.

So, apparently, it is not undecidable.

Fact is that it is empirically undecidable. If the Elo calculation would decide it, as you erroneously seem to think, it would only prove the Elo calculation is not reliable, because it gives the wrong answer.

A wise person once said, "Theory and practice sometimes clash. And when that happens, theory loses. Every single time."

That was in fact an idiot. When theory and practice clash, it virtually always means that you goofed in practice. "The map says it is only 40km to Utrecht, and my odometer says I have already traveled 85km now...!". "Get lost!".

That is not because the math was wrong. It is because the theory was wrong.

The only place that LOS is ever used is where the two engines are almost exactly the same strength as a tiebreaker. Thank you for pointing out that it is utterly useless in this regard.

Keep dreaming your delusions. You brought up a stone some time ago. Let me refine it a bit:

You talked to God. And even a stone would have learned more from the experience.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 9:05 pm

"Fact is that it is empirically undecidable. If the Elo calculation would decide it, as you erroneously seem to think, it would only prove the Elo calculation is not reliable, because it gives the wrong answer."

The Elo calculation said that the same engine had the same Elo all the way down to the last decimal digit. This is, in fact, the right answer.
Now, the fractional part is still in question even after half a million trials. But that does not matter, because Elo is (wisely) reported as an integer.
So, the Elo difference of zero is exactly correct for every single measurement. The hypothesis is confirmed. An engine is not stronger than itself.

I think maybe it is a good time to retire the discussion. I think LOS is a bad and misleading statistic, and you think it is a fine and accurate statistic.
I even agree with some possible sets of data, it might return a pretty good answer. But even in those places, Elo would return a better answer (IMO).

I actually feel very badly about the somewhat acrimonious nature of the debate. That is because I consider hgm (for his work on Winboard) and Ronald (for his work on Syzygy tablebase files and Cfish and other things) to be true heroes of computer chess. I think, as Wesley said, "we are at an impasse" and so future debate will prove equally fruitless.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 9:22 pm

The truncation of the fractional part of the Elo figure is a rough equivalent to "The values are the same within the bounds of experimental uncertaintly"

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo