Throwing out draws to calculate Elo

hgm · Post by **hgm** » Sun Jul 05, 2020 9:58 pm

Dann Corbit wrote: ↑Sun Jul 05, 2020 9:05 pm The Elo calculation said that the same engine had the same Elo all the way down to the last decimal digit. This is, in fact, the right answer.

No, it is an approximate, and therefore wrong answer. And useless ifor determining which is best of the difference is less than the rounding error.

Now, the fractional part is still in question even after half a million trials. But that does not matter, because Elo is (wisely) reported as an integer.
So, the Elo difference of zero is exactly correct for every single measurement. The hypothesis is confirmed. An engine is not stronger than itself.

And now apply that 'wisdom' to an 'engine' where you set the winning threshold at 0.500001. Or 0.500000001. Now it will also (after rounding) say they are exactly equal. Which they were not. So the Elo answer is wrong. For the single case for which you artificially forced it to be right, it is now wrong for an infinity of other cases. You succeeded in making Elo calculations unsuitable for engines that are different. "Only to be used when beforehand you are absolutely sure the engines are the same, otherwise you will get wrong answers." Congratulations!

Why not report all ratings in kilo-Elo, and round? Wouldn't that save an awful lot of test games? Have you already suggested it to the Stockfish developers?

I think maybe it is a good time to retire the discussion. I think LOS is a bad and misleading statistic, and you think it is a fine and accurate statistic.
I even agree with some possible sets of data, it might return a pretty good answer. But even in those places, Elo would return a better answer (IMO).

I actually feel very badly about the somewhat acrimonious nature of the debate. That is because I consider hgm (for his work on Winboard) and Ronald (for his work on Syzygy tablebase files and Cfish and other things) to be true heroes of computer chess. I think, as Wesley said, "we are at an impasse" and so future debate will prove equally fruitless.

It is as fruitless as you want it to be. You are obviously not willing to listen to reason, and keep chanting the same falsehoods no matter how often they have been refuted. So yes, discussion about this for you personally is utterly pointless.

But that doesn't matter. We discuss in public so others can learn. As long as it is glaringly obvious that literally everything you said is completely detached from any reality, as I think by now it must be 10 times over, we should feel good about it.

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 10:07 pm

Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)

hgm · Post by **hgm** » Sun Jul 05, 2020 10:30 pm

Dann Corbit wrote: ↑Sun Jul 05, 2020 10:07 pm Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)

Yeah. No wonder it is so weak, with all that random tuning...

Robert Pope · Post by **Robert Pope** » Mon Jul 06, 2020 1:26 am

Dann Corbit wrote: ↑Sun Jul 05, 2020 10:07 pm Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)

It's only a uniform generator if the two engines are indeed of equal strength, right? In which case, it doesn't matter which engine is promoted. If one engine is indeed stronger than the other, LOS won't be random uniform, but in favor of the stronger engine.

Pio · Post by **Pio** » Mon Jul 06, 2020 1:30 am

Robert Pope wrote: ↑Mon Jul 06, 2020 1:26 am
Dann Corbit wrote: ↑Sun Jul 05, 2020 10:07 pm Elo figures are already correctly reported with +/- Elo error bounds.
There is no need to ask the Stockfish developers to truncate anything.
However, if they are using LOS in fishtest, they should be warned that they are tuning Stockfish with a uniform PRNG (given enough games)
It's only a uniform generator if the two engines are indeed of equal strength, right? In which case, it doesn't matter which engine is promoted. If one engine is indeed stronger than the other, LOS won't be random uniform, but in favor of the stronger engine.

+1

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 1:53 am

Your conjecture is correct. The question is, will the actual data points be buried by the random error that is inherent in the technique? If you look at the data provided, the randomness envelope may or may not overwhelm the actual data. The more observations you have, the less reliable the data becomes, which is the opposite of how most statistics work. So you have a sort of irony. If you have only a few observations, it is less likely that the data was buried under random noise. But at the same time, since you only have a few observations the data is less reliable for that reason.

Now, I also agree that as superiority becomes more and more pronounced, it is more and more likely to show up "piercing the cloud of randomness" as it were and at some point it would take a huge number of observations to bury it (perhaps billions). But at the same time, if there really is an enormous difference in strength, Elo will already have told us this and we will not need any tiebreaker to decide which engine is stronger.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 2:10 am

Let's examine the first quartile in the data that I provided:
49893 losses and 50107 wins. For the first quartile the imbalance is 107 games out of 50,000 (a very small percentage)

Now, let's suppose that the change actually makes the program weaker, and we just happened to collect the random value from right at the first quartile.

If the the superior engine wins 107 games in total more than the weaker, we are still not back to zero yet, beacsue we now have 50,000 to 50,107. We actually need 214 more wins and no losses to pull us back to a LOS of 0.0, and however many more are needed to show a superiority (if we choose a razor's edge decider then we would need 215 additional wins (caused by the superiority) to set the statistic right and undo the effects of randonmess. Keep in mind that the random coin flipper I am using will (at infinity) have a count of heads and tails that (as a ratio) will be indistinguishably close to half wins and half losses. The only reason that the counts are lopsided is that the sequence of heads and tails as generated is randomized by the Mersenne Twister.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 2:13 am

Consider also that 50 percent of all measurements will either be above the high quartile or below the low quartile (on average).
So half of the time, the problem will be of that scale.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 2:42 am

This does not mean, of course, that half of the time you will have to overcome a lopsided advantage of that size. Because about half of the time, the advantage would actually be slanted towards the engine that really was stronger.

However, even when the correct score is pretty close to the true mean, there will still be *something* to overcome half of the time.
For instance, even at 45,000 where we are only 5000 away from the centroid we have this imbalance caused by the randomness of the sequence:
49979 losses 50021 wins, so the window is 42 games to one side even that close.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 2:51 am

Reducing the game count does not linearly decrease the error. It is proportional to the square root of the game count.
So 100 times more games only gives an error that is proportional to ten times the size and 100,000 times more games will be 100 times more (proportionately). {there is also a constant of proportionality involved}

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo