Throwing out draws to calculate Elo

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
hgm
Posts: 24651
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Throwing out draws to calculate Elo

Post by hgm » Sun Jul 05, 2020 7:58 pm

Dann Corbit wrote:
Sun Jul 05, 2020 7:05 pm
The Elo calculation said that the same engine had the same Elo all the way down to the last decimal digit. This is, in fact, the right answer.
No, it is an approximate, and therefore wrong answer. And useless ifor determining which is best of the difference is less than the rounding error.
Now, the fractional part is still in question even after half a million trials. But that does not matter, because Elo is (wisely) reported as an integer.
So, the Elo difference of zero is exactly correct for every single measurement. The hypothesis is confirmed. An engine is not stronger than itself.
And now apply that 'wisdom' to an 'engine' where you set the winning threshold at 0.500001. Or 0.500000001. Now it will also (after rounding) say they are exactly equal. Which they were not. So the Elo answer is wrong. For the single case for which you artificially forced it to be right, it is now wrong for an infinity of other cases. You succeeded in making Elo calculations unsuitable for engines that are different. "Only to be used when beforehand you are absolutely sure the engines are the same, otherwise you will get wrong answers." Congratulations!

Why not report all ratings in kilo-Elo, and round? Wouldn't that save an awful lot of test games? Have you already suggested it to the Stockfish developers?
I think maybe it is a good time to retire the discussion. I think LOS is a bad and misleading statistic, and you think it is a fine and accurate statistic.
I even agree with some possible sets of data, it might return a pretty good answer. But even in those places, Elo would return a better answer (IMO).

I actually feel very badly about the somewhat acrimonious nature of the debate. That is because I consider hgm (for his work on Winboard) and Ronald (for his work on Syzygy tablebase files and Cfish and other things) to be true heroes of computer chess. I think, as Wesley said, "we are at an impasse" and so future debate will prove equally fruitless.
It is as fruitless as you want it to be. You are obviously not willing to listen to reason, and keep chanting the same falsehoods no matter how often they have been refuted. So yes, discussion about this for you personally is utterly pointless.

But that doesn't matter. We discuss in public so others can learn. As long as it is glaringly obvious that literally everything you said is completely detached from any reality, as I think by now it must be 10 times over, we should feel good about it.

Dann Corbit
Posts: 11218
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Throwing out draws to calculate Elo

Post by Dann Corbit » Sun Jul 05, 2020 8:07 pm

Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
hgm
Posts: 24651
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Throwing out draws to calculate Elo

Post by hgm » Sun Jul 05, 2020 8:30 pm

Dann Corbit wrote:
Sun Jul 05, 2020 8:07 pm
Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)
Yeah. No wonder it is so weak, with all that random tuning...

Robert Pope
Posts: 522
Joined: Sat Mar 25, 2006 7:27 pm

Re: Throwing out draws to calculate Elo

Post by Robert Pope » Sun Jul 05, 2020 11:26 pm

Dann Corbit wrote:
Sun Jul 05, 2020 8:07 pm
Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)
It's only a uniform generator if the two engines are indeed of equal strength, right? In which case, it doesn't matter which engine is promoted. If one engine is indeed stronger than the other, LOS won't be random uniform, but in favor of the stronger engine.

Pio
Posts: 171
Joined: Sat Feb 25, 2012 9:42 pm
Location: Stockholm
Contact:

Re: Throwing out draws to calculate Elo

Post by Pio » Sun Jul 05, 2020 11:30 pm

Robert Pope wrote:
Sun Jul 05, 2020 11:26 pm
Dann Corbit wrote:
Sun Jul 05, 2020 8:07 pm
Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)
It's only a uniform generator if the two engines are indeed of equal strength, right? In which case, it doesn't matter which engine is promoted. If one engine is indeed stronger than the other, LOS won't be random uniform, but in favor of the stronger engine.
+1

Dann Corbit
Posts: 11218
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Throwing out draws to calculate Elo

Post by Dann Corbit » Sun Jul 05, 2020 11:53 pm

Your conjecture is correct. The question is, will the actual data points be buried by the random error that is inherent in the technique? If you look at the data provided, the randomness envelope may or may not overwhelm the actual data. The more observations you have, the less reliable the data becomes, which is the opposite of how most statistics work. So you have a sort of irony. If you have only a few observations, it is less likely that the data was buried under random noise. But at the same time, since you only have a few observations the data is less reliable for that reason.

Now, I also agree that as superiority becomes more and more pronounced, it is more and more likely to show up "piercing the cloud of randomness" as it were and at some point it would take a huge number of observations to bury it (perhaps billions). But at the same time, if there really is an enormous difference in strength, Elo will already have told us this and we will not need any tiebreaker to decide which engine is stronger.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Dann Corbit
Posts: 11218
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Throwing out draws to calculate Elo

Post by Dann Corbit » Mon Jul 06, 2020 12:10 am

Let's examine the first quartile in the data that I provided:
49893 losses and 50107 wins. For the first quartile the imbalance is 107 games out of 50,000 (a very small percentage)

Now, let's suppose that the change actually makes the program weaker, and we just happened to collect the random value from right at the first quartile.

If the the superior engine wins 107 games in total more than the weaker, we are still not back to zero yet, beacsue we now have 50,000 to 50,107. We actually need 214 more wins and no losses to pull us back to a LOS of 0.0, and however many more are needed to show a superiority (if we choose a razor's edge decider then we would need 215 additional wins (caused by the superiority) to set the statistic right and undo the effects of randonmess. Keep in mind that the random coin flipper I am using will (at infinity) have a count of heads and tails that (as a ratio) will be indistinguishably close to half wins and half losses. The only reason that the counts are lopsided is that the sequence of heads and tails as generated is randomized by the Mersenne Twister.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Dann Corbit
Posts: 11218
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Throwing out draws to calculate Elo

Post by Dann Corbit » Mon Jul 06, 2020 12:13 am

Consider also that 50 percent of all measurements will either be above the high quartile or below the low quartile (on average).
So half of the time, the problem will be of that scale.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Dann Corbit
Posts: 11218
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Throwing out draws to calculate Elo

Post by Dann Corbit » Mon Jul 06, 2020 12:42 am

This does not mean, of course, that half of the time you will have to overcome a lopsided advantage of that size. Because about half of the time, the advantage would actually be slanted towards the engine that really was stronger.

However, even when the correct score is pretty close to the true mean, there will still be *something* to overcome half of the time.
For instance, even at 45,000 where we are only 5000 away from the centroid we have this imbalance caused by the randomness of the sequence:
49979 losses 50021 wins, so the window is 42 games to one side even that close.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Dann Corbit
Posts: 11218
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Throwing out draws to calculate Elo

Post by Dann Corbit » Mon Jul 06, 2020 12:51 am

Reducing the game count does not linearly decrease the error. It is proportional to the square root of the game count.
So 100 times more games only gives an error that is proportional to ten times the size and 100,000 times more games will be 100 times more (proportionately). {there is also a constant of proportionality involved}
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Post Reply