Throwing out draws to calculate Elo

Rein Halbersma · Post by **Rein Halbersma** » Fri Jul 03, 2020 9:25 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 2:24 am If the draws do not matter in understanding who is stronger, why does the Elo calculation get a totally wrong answer if you set the draws to zero?

I do understand we are looking for a razor turning point and not a magnitude. But I think it should be more obvious which is stronger if we know an engine is ten times stronger instead of .01% stronger. Or it should affect the size of our confidence interval.

Because Elo difference and LOS are two different questions. Elo difference requires to take draws into account. It just translates to the expected score between two players. 8 wins out of 1 million games translates to a tiny Elo difference. LOS on the other hand is merely concerned with which player is the strongest, regardless of how much stronger. With 8 wins and no losses, no matter the number of games, the probability is 1 in 256 that the players are equal. This is just how the math works out.

hgm · Post by **hgm** » Fri Jul 03, 2020 9:32 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 12:01 am If I grab an entry from the database, chosen at random, how close is it to 0.5?

Close enough to reject with 95% confidence the hypothesis that the engines must be unequal, in 95% of the cases.

Close enough to reject it with 99% confidence in 99% of the cases.

etc.

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 9:36 am

Rein Halbersma wrote: ↑Fri Jul 03, 2020 9:25 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 2:24 am If the draws do not matter in understanding who is stronger, why does the Elo calculation get a totally wrong answer if you set the draws to zero?

I do understand we are looking for a razor turning point and not a magnitude. But I think it should be more obvious which is stronger if we know an engine is ten times stronger instead of .01% stronger. Or it should affect the size of our confidence interval.
Because Elo difference and LOS are two different questions. Elo difference requires to take draws into account. It just translates to the expected score between two players. 8 wins out of 1 million games translates to a tiny Elo difference. LOS on the other hand is merely concerned with which player is the strongest, regardless of how much stronger. With 8 wins and no losses, no matter the number of games, the probability is 1 in 256 that the players are equal. This is just how the math works out.

Do you actually believe in a file of 8 million games with 8 wins and no losses that the 8 wins are not statistical noise?

hgm · Post by **hgm** » Fri Jul 03, 2020 9:40 am

There is no place for belief in mathematics. I would know that there is only a 1 in 256 likelihood that random noise on equal engines would achieve that result.

But you probably also believe that 8-0 in an 8-game match without draws would just be statistical noise. So the problem you have is not really related to the draws. It is related to your lack of understanding of what is statistically significant, and what not.

Rein Halbersma · Post by **Rein Halbersma** » Fri Jul 03, 2020 9:56 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 9:36 am
Rein Halbersma wrote: ↑Fri Jul 03, 2020 9:25 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 2:24 am If the draws do not matter in understanding who is stronger, why does the Elo calculation get a totally wrong answer if you set the draws to zero?

I do understand we are looking for a razor turning point and not a magnitude. But I think it should be more obvious which is stronger if we know an engine is ten times stronger instead of .01% stronger. Or it should affect the size of our confidence interval.
Because Elo difference and LOS are two different questions. Elo difference requires to take draws into account. It just translates to the expected score between two players. 8 wins out of 1 million games translates to a tiny Elo difference. LOS on the other hand is merely concerned with which player is the strongest, regardless of how much stronger. With 8 wins and no losses, no matter the number of games, the probability is 1 in 256 that the players are equal. This is just how the math works out.
Do you actually believe in a file of 8 million games with 8 wins and no losses that the 8 wins are not statistical noise?

For Elo difference, yes, the 8 wins out of a million are tiny compared to the standard error on the mean, about 1 / sqrt(million) ~ 1 in thousand. So Elo-wise, the engines are within a nose-length. But there's no question that the nose-length is almost certainly in favor of the engine with 8 wins.

It's counter-intuitive, I know, but that's the way the math works out.

hgm · Post by **hgm** » Fri Jul 03, 2020 9:58 am

I could add that in the Elo calculation the ratings would be outside each other's error bars. So the Elo says exactly the same thing as the LOS.

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 10:14 am

Rein Halbersma wrote: ↑Fri Jul 03, 2020 9:56 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 9:36 am
Rein Halbersma wrote: ↑Fri Jul 03, 2020 9:25 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 2:24 am If the draws do not matter in understanding who is stronger, why does the Elo calculation get a totally wrong answer if you set the draws to zero?

I do understand we are looking for a razor turning point and not a magnitude. But I think it should be more obvious which is stronger if we know an engine is ten times stronger instead of .01% stronger. Or it should affect the size of our confidence interval.
Because Elo difference and LOS are two different questions. Elo difference requires to take draws into account. It just translates to the expected score between two players. 8 wins out of 1 million games translates to a tiny Elo difference. LOS on the other hand is merely concerned with which player is the strongest, regardless of how much stronger. With 8 wins and no losses, no matter the number of games, the probability is 1 in 256 that the players are equal. This is just how the math works out.
Do you actually believe in a file of 8 million games with 8 wins and no losses that the 8 wins are not statistical noise?
For Elo difference, yes, the 8 wins out of a million are tiny compared to the standard error on the mean, about 1 / sqrt(million) ~ 1 in thousand. So Elo-wise, the engines are within a nose-length. But there's no question that the nose-length is almost certainly in favor of the engine with 8 wins.

It's counter-intuitive, I know, but that's the way the math works out.

It isn't the math that bothers me, it is the model.
If I flipped a coint 8 times and then gave you an Elo figure you would tell me it is useless.
If i play a thousands games and you throw out all but 8 and give me an LOS figure, it not only seems as bad as the Elo calculation, it seems worse. Because the Elo calculation uses the draws to determine strength and the LOS simply ignores them. And the randomness for a thousand games gives a large expected deviation. But this too does not matter.

I guess maybe it is just my brain wrestling with what I feel is common sense. But at any rate, I cannot feel convinced until I understand why these things should be so. And the math is not convincing because I feeld the model *must* be wrong. We are ignoring the majority of the data, and relying on a tiny cherry-picked sample and assuming that the error bars are as tiny as the sample.

I shudder to think of it.

hgm · Post by **hgm** » Fri Jul 03, 2020 10:31 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 10:14 amIf I flipped a coint 8 times and then gave you an Elo figure you would tell me it is useless.

And this misconception of yours is really at the root of your troubles.

Because if the result of those flips were 8-0 we would of course not say at all that it was useless. It would be highly significant, and make it very likely (i.e >99% confidence) that the coin was not fair. Or, in the case of engines, that there must be a sizable Elo difference. Put it in your Elo calculator if you don't believe it.

Of course if the result had been 5-3, then it would have likely just been statistical noise.

We discussed all this in connection with the WCCC result. And apparently you learned nothing from it!

syzygy · Post by **syzygy** » Fri Jul 03, 2020 10:38 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 3:40 am
syzygy wrote: ↑Fri Jul 03, 2020 3:27 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 2:12 am Yes and no, in that you can call it 1/256, but the width of the error bars on the measurement should be much, much wider. In fact, so wide as to render the data meaningless.
So the number is the same. But the quality of the number is lower. Lower to the point of uselessness.
WHat do you mean error bar? Is 1/256 somehow more reliable if there were N=10^100 draws than if there were N=3 draws? Do you have any idea what it means you are saying?
It is not more reliable, it is less reliable.

The number itself, 1/256 is the same. Yes, I agree with you about that. But the data spread we expect to see when you have10^100 draws are enormous. In fact, it is very shocking that we only have 8 wins. So shocking that it makes the data points look much more like sports.

It is only shocking if there is no explanation for the large number of draws. Most likely there is an explanation, such as one of the explanations that were given already in this thread.

Under the assumption that both engines are equally strong (or more precisely: that each game A versus B has probabilities p, (1-2p), p for win, draw, loss), the probability that the first 8 wins go to A is 1/256. This is independent of N. (But N does give an estimation of p.)

You seem to get confused when N is really big (the likely result of p being very small). But under the hypothesis that the outcomes of games between A and B are distributed w,d,l as p,(1-2p),p, for some unknown value of p, the first 8 wins going to A has the same implication on whether or not we should reject the hypothesis, independent of the number of draws N.

However, if you would fix p to be 0.1 in your hypothesis (which is then no longer just about A being as strong as B), then N being 10^100 obviously means that your hypothesis should be rejected.

duncan · Post by **duncan** » Fri Jul 03, 2020 10:58 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 9:36 am

Do you actually believe in a file of 8 million games with 8 wins and no losses that the 8 wins are not statistical noise?

Not been following this properly but in your opinion what is the maximum amount of games that you can have for the 8 wins not to be statistical noise?

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo