Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 1:54 pm

hgm wrote: ↑Tue Jun 30, 2020 1:40 pm If you do watch until you see a win, will it be engine A or engine B that wins?

Since there is no difference in strength and assuming I did see a win by some fluke, I would expect it to be exactly a coin toss which one it would be. The real difference in strength (if there is one) is so infinitesimally small that 8 byte floating point would not be able to predict which one it would be.

Do you really believe that the likelihood of superiority is the same?

A million draws in a row does not tell you that the opponents are of very similar strength?

Clearly 500 outcomes in 10^100 means that there is no significance to the strength difference.

They are *exactly* the same strength.

Alayan · Post by **Alayan** » Tue Jun 30, 2020 2:08 pm

Likelihood of superiority is almost unrelated to the expected score. High likelihood of superiority doesn't mean high expected score, it means high probability that the engine can score over (50+d)% over a high enough number of games. The d can be as small as desired, but the idea is that for a given pair of unequal engines there exist an intrinsic d, even if we don't know it. With two perfectly equal engines, the score between them will trend towards 50% as the number of games goes to infinity, which means that for any d > 0 the probability of a score over (50+d)% trends towards 0.

If you take an engine A, and an engine B that is a modification of engine A that will purposefully throw 1 game in 1 billion but is otherwise identical. Then engine A is provably stronger than engine B. The LoS is 100%. But the expected score of A against B is for all practical purposes 50%.

"High probability" depends on game results and on the prior used. The assumptions made by the prior are in many cases incorrect and using the error bars you get from assuming uniform prior can be misleading.

hgm · Post by **hgm** » Tue Jun 30, 2020 3:10 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 1:54 pmSince there is no difference in strength and assuming I did see a win by some fluke, I would expect it to be exactly a coin toss which one it would be.

That is a completely wrong expectation. It is (exactly) as misguided as believing the 9th game will be a coin toss after the first 8 resulted in 8-0.

A million draws in a row does not tell you that the opponents are of very similar strength?

Yes, it does. What has that got to do with it?

Clearly 500 outcomes in 10^100 means that there is no significance to the strength difference.

Define 'significant'. I only know the meaning of that when applied to the outcome of a stochastic process: a certain difference in match results can be statistically significant or not. 'Strength' is not the outcome of any stochastic process; it is an intrinsic, fixed property of the engine.

Of course there is no large, or even meaningful difference in strength.

They are *exactly* the same strength.

Then you don't know the meaning of 'exact'.

1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001

is not exactly the same as 1. (Actually the difference is eye catching!) It is, in fact, absolutely certain that it is larger.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 7:35 pm

And you know that the program won the 450 games because the program was stronger and not because one of the two machines was one Hz faster and therfore won more games due to experimental error when the operator did not switch machines routinely during the contests?

And you know that in the eons it took to caculate all those games that the 450 games were actually won and we did not have a cosmic ray hit one of the accumulators (See Akan Ban's experience with cosmic rays in calculation of Perft(15))?

And you know that the program won the 450 extra games rather than an error in accumulation due to doubles eventually not accumulating a sum of 1 because the distance between consecutive numbers becomes larger at some distant point and we therefore had a numerical error?

A million games in a row that are drawn is absurdly strong evidence that the programs are exactly equal.
And 10^100-450 out of 10^100 draws is incredible evidence of identical strength, far greater than anyone could ever hope for in a real experiment.
By contrast 450 out of 10^100 is no evidence at all.

The coin tossing experiment I showed proves that we should not expect exact equality in a long match amonst pure equals. In fact, it is lunacy to do so.

All I can say is, "The emporer is not wearing any clothes!"

MonteCarlo · Post by **MonteCarlo** » Tue Jun 30, 2020 8:02 pm

Again, the question is not whether it is possible for a match of equal engines with a very, very, very high draw rate to end with a high percentage of decisive games being won by one of them.

The question is whether it is more likely to happen when the winning engine is stronger, combined with whatever prior we're using for distribution of engine strengths.

Obviously it's possible for engines with a very high draw rate and equal winning chances in decisive games to play a long match where one of them wins a large majority of the decisive games.

No one has ever denied this. The question is whether this is more likely to have occurred by chance in a match between equal engines or to have occurred as an expected result between unequal engines, which will depend on your prior for distribution of rating differences. Unless your prior makes it much more likely for exactly equal engines to be paired with each other, this result will be more likely to have occurred in a match between unequal engines.

There's simply no good basis to reason from such a result to "the engines are EXACTLY equal", unless you've specified one of those priors that make matchups of exactly equal engines much more likely than matchups between unequal engines.

hgm · Post by **hgm** » Tue Jun 30, 2020 8:43 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 7:35 pm And you know that the program won the 450 games because the program was stronger and not because one of the two machines was one Hz faster and therfore won more games due to experimental error when the operator did not switch machines routinely during the contests?

And you know that in the eons it took to caculate all those games that the 450 games were actually won and we did not have a cosmic ray hit one of the accumulators (See Akan Ban's experience with cosmic rays in calculation of Perft(15))?

And you know that the program won the 450 extra games rather than an error in accumulation due to doubles eventually not accumulating a sum of 1 because the distance between consecutive numbers becomes larger at some distant point and we therefore had a numerical error?

Point 1 is tantamount to claiming the result is not valid in the first place. Of course you cannot draw any conclusion from invalid test scores. If the operator consciously cheats, he can easily make the stroger program lose. Perhaps the engine that won was actually 300 Elo weaker, and the operator just reset the computer in all the 3e100 games that it was on the way of losing or winning, to replay them until they happened to be draws. This isn't worth arguing about.

If this is a valid test result (no cheating involved) you would have to explain why it was always the same engine that suffered from 'accidental operator mistakes', cosmic rays, etc. That it happened 8 times to one, and never to the other by accident is quite unlikely.

A million games in a row that are drawn is absurdly strong evidence that the programs are exactly equal.

Not at all. There are infinitely more cases where they are not exactly equal, which could give exactly the same result (if they were nearly perfect, but not quite). Of course it begs for an explanation that they would always draw, and almost never win (approximately equally many times). This suggests they are playing near-perfect chess.

And 10^100-450 out of 10^100 draws is incredible evidence of identical strength, far greater than anyone could ever hope for in a real experiment.
By contrast 450 out of 10^100 is no evidence at all.

Nonsense. 450 is not nothing, and it will never become nothing just because something else is big. 450 wins are 450 wins, and will always stay 450 wins. Extreme ratios of the number of wins and losses do not become any more likely when the win+lose probability goes down. They only become more likely when the ratio of the win vs loss probability increases.

You can easily check that for yourself, if you don't believe it, with your coin-flip engine: give a win if r < epsilon, a loss if r > 1-epsilon, and a draw otherwise. Let each epsilon play matches until there are 8 non-draws, and the result 8-0 has occured, say, a dozen times. Then look which fraction of the matches did have 8-0 for the non-draws (rather than 7-1 or 4-4 etc.). Then divide epsilon by 10 and do the same. Keep doing that until your patience runs out.

The coin tossing experiment I showed proves that we should not expect exact equality in a long match amonst pure equals. In fact, it is lunacy to do so.

If it proved anything, it was that you never got a 100% vs 0% result. 8-0 was such a result, though. So you 'proved' that the engines that were drawing so much could not have been equal.

syzygy · Post by **syzygy** » Tue Jun 30, 2020 9:19 pm

Dann Corbit wrote: ↑Mon Jun 29, 2020 9:37 am It is a generally accepted practice to throw out draws and use only wins and losses to calculate the relative Elo of a set of engines.
So let's do a gedankenexperiment:
Engine A plays Engine B one goolgol (10^100) times.
There are 10^100 - 8 draws and 8 wins for engine B.
Standard calculation would make engine B much stronger and also give a very large LOS for engine B.
But this is totally absurd.
If we watched games for many lifetimes between engine A and engine B, we would (almost surely) never see anything but a draw, despite engine B's much larger Elo and LOS.
At this point, the 8 wins are clearly random noise.

Opinions?

The 8 wins in your scenario are exactly as much random noise as 8 wins in 8 games total played. The LOS that can be derived from each scenario is certainly the same. The Elo should not be the same, since drawn games should affect Elo calculations as far as I am aware.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 9:28 pm

syzygy wrote: ↑Tue Jun 30, 2020 9:19 pm
Dann Corbit wrote: ↑Mon Jun 29, 2020 9:37 am It is a generally accepted practice to throw out draws and use only wins and losses to calculate the relative Elo of a set of engines.
So let's do a gedankenexperiment:
Engine A plays Engine B one goolgol (10^100) times.
There are 10^100 - 8 draws and 8 wins for engine B.
Standard calculation would make engine B much stronger and also give a very large LOS for engine B.
But this is totally absurd.
If we watched games for many lifetimes between engine A and engine B, we would (almost surely) never see anything but a draw, despite engine B's much larger Elo and LOS.
At this point, the 8 wins are clearly random noise.

Opinions?
The 8 wins in your scenario are exactly as much random noise as 8 wins in 8 games total played. The LOS that can be derived from each scenario is certainly the same. The Elo should not be the same, since drawn games should affect Elo calculations as far as I am aware.

Not a chance.
Not a ghost of a chance.
Not even close.

If you made a change to an engine and ran 500 games and the changed engine won 500 games, you would keep the improvement and be fairly certain it was better.

If you made a change to an engine and ran a googol games and it had a google-500 draws, 450 wins and 50 losses you would conclude that there is no improvment.

Everyone is acting like a draw is not data, but a draw is just as valuable as a win. It tells you the opponents are equal.
You are throwing out (one googol - 500) data points {which are just as valuable as the ones that you are keeping and which show that the engines are evenly matched} from the experiment and basing the result on a tiny bit of fluff at the end of the significance of a real number, smaller than the size of an elementary particle compared to the size of the universe.

syzygy · Post by **syzygy** » Tue Jun 30, 2020 9:41 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 9:28 pm
syzygy wrote: ↑Tue Jun 30, 2020 9:19 pm
Dann Corbit wrote: ↑Mon Jun 29, 2020 9:37 am It is a generally accepted practice to throw out draws and use only wins and losses to calculate the relative Elo of a set of engines.
So let's do a gedankenexperiment:
Engine A plays Engine B one goolgol (10^100) times.
There are 10^100 - 8 draws and 8 wins for engine B.
Standard calculation would make engine B much stronger and also give a very large LOS for engine B.
But this is totally absurd.
If we watched games for many lifetimes between engine A and engine B, we would (almost surely) never see anything but a draw, despite engine B's much larger Elo and LOS.
At this point, the 8 wins are clearly random noise.

Opinions?
The 8 wins in your scenario are exactly as much random noise as 8 wins in 8 games total played. The LOS that can be derived from each scenario is certainly the same. The Elo should not be the same, since drawn games should affect Elo calculations as far as I am aware.
Not a chance.
Not a ghost of a chance.
Not even close.

I'm sorry but what I said about LOS Is pure objective fact.

There is a world of difference between 8 wins, 1000 draws, 0 losses and 508 wins, 0 draws, 500 losses for A playing B. In the first scenario, it is statistically very very likely that A is superior to B. In the second scenario, it is only slightly likely that A is superior to B. Can you see that this is true?

This discussion is interesting because it shows how totally useless it is in testing to strive for fewer draws.
It would be ideal if the stronger engine would never lose against the weaker engine, since then you only have to wait for the first decided game to know which engine is the stronger engine. Of course you can never be sure in practice that the stronger engine will not lose any games, and in fact in chess this is very unlikely to happen, but certainly it is much preferable to have a test series end in 8-0 with 1000 draws (= the patch should be accepted) than to have it end in 508-500 with 0 draws (= it is only a bit more likely that the patch is more good than bad).

This also shows that a number of games at a longer time control (less randomness, more draws) generally say more about which is stronger than the same number of games at a short time control (more randomness, fewer draws).

Likewise, this also shows that in a testing environment with more random noise (e.g. because 10 games are played simultaenously on 8 cores), more games need to be played to establish which engine is stronger than a testing environment with less random noise. The noise decreases the number of draws, which is bad. (For most people this is counterintuitive, although it should be intuitively true that noise is bad.)

Everyone is acting like a draw is not data, but a draw is just as valuable as a win. It tells you the opponents are equal.

Nope, it tells you that the weaker side is still able to collect draw points and therefore does not stay behind that much Elo wise.

hgm · Post by **hgm** » Tue Jun 30, 2020 9:44 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 9:28 pmIf you made a change to an engine and ran a googol games and it had a google-500 draws, 450 wins and 50 losses you would conclude that there is no improvment.

YOU would conclude there is no improvement. Don't try to project your own messed-up thoughts onto others!

I would certainly embrace the change. It would mean I bridged a large part of the remaining gap to perfection. Even more so when the decided games were 8-0. I might actually believe that I bridged 100% of the remaining gap in that case!

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo