Throwing out draws to calculate Elo

syzygy · Post by **syzygy** » Fri Jul 03, 2020 11:39 pm

Dann Corbit wrote: ↑Fri Jul 03, 2020 11:35 pm Suppose you play ten games. The result is 8 wins 2 draws for engine A. A huge LOS for A. Will you decide that the change you made is worthwhile and incorporate the new code into your engine?

If not, why not? The LOS is enormous.

So now the number of draws plays no role anymore? What is your position exactly?

Is there anything left of the title of this thread? We know that your reference to "Elo" is total mistake. Do you no longer care about the number of draws, either?

Or are you being a moving target on purpose, changing your position with every new post?

syzygy · Post by **syzygy** » Fri Jul 03, 2020 11:49 pm

On the one hand, you are saying that 1000 wins, 0 losses and 10^10^1000-1000 draws means even less than 8 wins, 0, losses and 10^100-8 draws.

On the other hand, you are saying that 8 wins, 0 losses, and 2 draws doesn't mean much.

So apparently the number of draws makes no difference to you (this is actually correct! applause!).

But now the conclusion must be that 1000 wins, 0 losses, and 2 draws means even less than 8 wins, 0 losses, and 2 draws.

Are you sure about that?

Do you realise what a confused impression you are making?

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 11:52 pm

I am addressing the question pieces at a time, and the draws are an important facet. I was just making the point that you don't believe in LOS either, because you do not trust the LOS number.

I mean, if LOS says that the change is almost certain to be an improvement, why wouldn't you use it?

Clearly, you believe that randomness is involved and therefore the 8 wins are not enough data to make a sound decision.

And yet (I believe) the 8-0-2 LOS is more sound than an 8-0-1000 LOS because the first set of measurements is a better indicator that A is probably stronger than B than the second set. We want to know a binary answer "A is stronger, true or false". The draws give evidence that we did not have before we collected all the draws.

We don't trust Elo either, until we have a large collection of games, despite it's using all the evidence at hand and despite its giving us a clear measure of not only stronger/weaker but also by how much stronger or weaker. And why don't we trust Elo? Because we need a lot of measurements to reduce the effects of randomness.

In other words, sensible people believe that the outcome of a contest of chess games has a sizeable element of randomness. We want a lot of data before we believe the result. The reason we want a lot of data is because the effects of randomness average out over a long period of trials.

hgm · Post by **hgm** » Sat Jul 04, 2020 12:16 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 11:35 pm Suppose you play ten games. The result is 8 wins 2 draws for engine A. A huge LOS for A. Will you decide that the change you made is worthwhile and incorporate the new code into your engine?

If not, why not? The LOS is enormous.

Of course I would. It is almost certainly an improvement.

syzygy · Post by **syzygy** » Sat Jul 04, 2020 12:20 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 11:52 pm I am addressing the question pieces at a time, and the draws are an important facet. I was just making the point that you don't believe in LOS either, because you do not trust the LOS number.

I mean, if LOS says that the change is almost certain to be an improvement, why wouldn't you use it?

Clearly, you believe that randomness is involved and therefore the 8 wins are not enough data to make a sound decision.

Of course randomness is involved in any match between non-deterministic engines. That does not mean no statistically sound conclusion can be drawn.

If all I know is 8-0-2 (selected without bias) and I can only accept or reject the patch that just changes some parameter values, I would accept it. It is more likely than not an improvement and I don't care whether that parameter is set to 5 or to 6.

However, where did that 8-0-2 come from? If someone is running a test and interrupted it after 10 games BECAUSE he saw 8-0-2, then the result is basically meaningless. Or if someone has been running a variety of tests, then happens to run a 10-game match which just happens to give 8-0-2, I would also decide to run my own tests. It is easy to inadvertently cheat with statistics.

And yet (I believe) the 8-0-2 LOS is more sound than an 8-0-1000 LOS because the first set of measurements is a better indicator that A is probably stronger than B than the second set. We want to know a binary answer "A is stronger, true or false". The draws give evidence that we did not have before we collected all the draws.

If the hypothesis is that A and B are equally strong and you run a match until 8 decided games, then the number of draws is no indicator of whether the hypothesis should be rejected or not. Only the W/L ratio is. I have explained this already.

Each decided game ends in win or loss, each with probability 0.5. The probability of 8 wins is 1/256. The number of games does not influence that. You cannot speak of "error margins" with respect to the calculated probability 1/256. (I am ignoring white/black advantage to keep things simple.) Of course 1/256 can happen just by chance, it will happen if you repeat often enough, or it can happen because you first run the match and only after seeing the outcome decided what pattern to look for. Doing statistics right is very tricky. But the number of draws makes no difference. (In particular if you decide upfront to throw out the draws! Obviously it is entirely sound to decide to run an experiment that throws out the draws and try to draw a statistical conclusion based just on the W/L ratio.)

Pio · Post by **Pio** » Sat Jul 04, 2020 12:30 am

syzygy wrote: ↑Sat Jul 04, 2020 12:20 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 11:52 pm I am addressing the question pieces at a time, and the draws are an important facet. I was just making the point that you don't believe in LOS either, because you do not trust the LOS number.

I mean, if LOS says that the change is almost certain to be an improvement, why wouldn't you use it?

Clearly, you believe that randomness is involved and therefore the 8 wins are not enough data to make a sound decision.
Of course randomness is involved in any match between non-deterministic engines. That does not mean no statistically sound conclusion can be drawn.

If all I know is 8-0-2 (selected without bias) and I can only accept or reject the patch that just changes some parameter values, I would accept it. It is more likely than not an improvement and I don't care whether that parameter is set to 5 or to 6.

However, where did that 8-0-2 come from? If someone is running a test and interrupted it after 10 games BECAUSE he saw 8-0-2, then the result is basically meaningless. Or if someone has been running a variety of tests, then happens to run a 10-game match which just happens to give 8-0-2, I would also decide to run my own tests. It is easy to inadvertently cheat with statistics.

And yet (I believe) the 8-0-2 LOS is more sound than an 8-0-1000 LOS because the first set of measurements is a better indicator that A is probably stronger than B than the second set. We want to know a binary answer "A is stronger, true or false". The draws give evidence that we did not have before we collected all the draws.
If the hypothesis is that A and B are equally strong and you run a match until 8 decided games, then the number of draws is no indicator of whether the hypothesis should be rejected or not. Only the W/L ratio is. I have explained this already.

If you think about it and as I have explained earlier the number of draws actually is an indicator of whether the hypothesis should be rejected or not even though it is not incorporated in LOS. The indicator is though the opposite of what Dann thinks is logical. I think HGM understands my point.

syzygy · Post by **syzygy** » Sat Jul 04, 2020 12:33 am

Pio wrote: ↑Sat Jul 04, 2020 12:30 am If you think about it and as I have explained earlier the number of draws actually is an indicator of whether the hypothesis should be rejected or not even though it is not incorporated in LOS. The indicator is though the opposite of what Dann thinks is logical. I think HGM understands my point.

I have read your explanations and I don't really disagree with them, but they do involve making assumptions about why the number of draws is so high. Without such assumptions, I don't think the draws mean anything.

Dann Corbit · Post by **Dann Corbit** » Sat Jul 04, 2020 12:46 am

syzygy wrote: ↑Sat Jul 04, 2020 12:20 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 11:52 pm I am addressing the question pieces at a time, and the draws are an important facet. I was just making the point that you don't believe in LOS either, because you do not trust the LOS number.

I mean, if LOS says that the change is almost certain to be an improvement, why wouldn't you use it?

Clearly, you believe that randomness is involved and therefore the 8 wins are not enough data to make a sound decision.
Of course randomness is involved in any match between non-deterministic engines. That does not mean no statistically sound conclusion can be drawn.

If all I know is 8-0-2 (selected without bias) and I can only accept or reject the patch that just changes some parameter values, I would accept it. It is more likely than not an improvement and I don't care whether that parameter is set to 5 or to 6.

However, where did that 8-0-2 come from? If someone is running a test and interrupted it after 10 games BECAUSE he saw 8-0-2, then the result is basically meaningless. Or if someone has been running a variety of tests, then happens to run a 10-game match which just happens to give 8-0-2, I would also decide to run my own tests. It is easy to inadvertently cheat with statistics.

And yet (I believe) the 8-0-2 LOS is more sound than an 8-0-1000 LOS because the first set of measurements is a better indicator that A is probably stronger than B than the second set. We want to know a binary answer "A is stronger, true or false". The draws give evidence that we did not have before we collected all the draws.
If the hypothesis is that A and B are equally strong and you run a match until 8 decided games, then the number of draws is no indicator of whether the hypothesis should be rejected or not. Only the W/L ratio is. I have explained this already.

Each decided game ends in win or loss, each with probability 0.5. The probability of 8 wins is 1/256. The number of games does not influence that. You cannot speak of "error margins" with respect to the calculated probability 1/256. (I am ignoring white/black advantage to keep things simple.) Of course 1/256 can happen just by chance, it will happen if you repeat often enough, or it can happen because you first run the match and only after seeing the outcome decided what pattern to look for. Doing statistics right is very tricky. But the number of draws makes no difference. (In particular if you decide upfront to throw out the draws! Obviously it is entirely sound to decide to run an experiment that throws out the draws and try to draw a statistical conclusion based just on the W/L ratio.)

As I have said, on many occasions, I do not disagree with your number (in this case 1/256). I disagree with the value of your number. With 8 trials the number is not trustworthy. I guess, if I was at the bottom of a well and someone gave me an instruction and following the instruction gave me a LOS of .51 of living and .49 of dying and if I did not follow the instruction the LOS of dying was 1 I would probably follow it. But I would be very unhappy about it.

I also do not believe you when you say you would change your engine based on 10 games with an 8 wins and 2 draws outcome. That's a lie and you know it

Dann Corbit · Post by **Dann Corbit** » Sat Jul 04, 2020 12:52 am

Bayesian logic says that we should change our expectation based on new information.
If I have 8 wins and 2 draws LOS gives me a number. If I add a thousand draws, LOS gives me the same number.
Elo, on the other hand, would use the new information to change the expectation. Elo is really only an indicator of stronger/weaker also , when you think about it. The actual numbers (Elo is 2500) are meaningless because the only thing it really tells you is difference in strength with respect to a pool.

The Elo figure, with enough draws, will tell you the change is not worthwhile. It has changed its expectation about who is stronger, based on draws. Is the Elo calculation wrong? If not, why were the draws helpful?

Dann Corbit · Post by **Dann Corbit** » Sat Jul 04, 2020 1:34 am

Dann Corbit wrote: ↑Sat Jul 04, 2020 12:46 am I also do not believe you when you say you would change your engine based on 10 games with an 8 wins and 2 draws outcome. That's a lie and you know it

Now (having thought about what I said) I am forced to apologize for saying that. I would never make a change based on ten games, and I was projecting that onto you, and I have no right to do that.
I apologize because I cannot read your mind, and that is the basis of my claim that you were lying

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo