Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 11:50 pm

Let me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.

hgm · Post by **hgm** » Tue Jun 30, 2020 11:55 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 11:34 pmBut if absolultely identically equal engines play each other thousands of times, the number of wins and losses will not be the same.

They would be if the engines played perfect chess. They would then always be zero, because all games would end in a draw.

You know this, play stockfish against itself for one hundred games.

But Stockfish does not play perfect chess. It loses games. Play a good Tic-Tac-Toe engine against itself. Then the number of wins will always be the same.

Hence, a small difference in wins and losses does not tell us which engine is stronger. In order to know if it *might* be stronger, it must be outside of the error bands.

The problem is that you don't grasp what is 'small'. We discussed this before in connection with a WCCC result, and it turned out you are really clueless where statistics is concerned. Statistically speaking, 8-0 is NOT a small difference. 508-500 is a small difference, however. Whether a difference is statistically significant or not depends on the ratio of the difference to the standard deviation. If the result is almost always a draw, the standard deviation is almost zero.

Your accidental forfeit once in a million games will not show up in a contest of equal engines, because the noise of randomness will far exceed the once in a million loss.

That depends. For engines that always draw there is zero noise; the match results are not random at all.

And an engine with an infinte Elo advantage should not have the same LOS as an engine with a zero Elo advantage. It could still be stronger, but the probability should be different.

So you insist. And you are wrong. Wrong will not turn into right no matter how persistent you are.

Maybe we do have difficulty understanding each other. I guess my problem is that when something makes no sense to me, I don't believe it.

You have difficulty understanding everyone else. We understand perfectly what you are saying, and we know it to be wrong.

I think that model is based only on math and not on probability. Otherwise it would not make absurd predictions.

But it doesn't make absurd predictions; it makes accurate predictions. It is just that your thinking is at odds with reality.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 11:58 pm

hgm wrote: ↑Tue Jun 30, 2020 11:40 pm
Dann Corbit wrote: ↑Tue Jun 30, 2020 10:36 pmSumming a column of numbers in a computer gives two different answers, depending on the direction of summation (numerical calculation error).
Not if they are integers.

And when we divide 917 by 1000 using our integers we get an answer of zero.
That is why we tend to do floating point calculations. And even an 8 byte integer cannot hold a number bigger than 2^64 - 1.

The design of the experiment can be imperfect (e.g. one machine is ever so slightly stronger than the other so we are not measuring the strength of the software difference but of the hardware difference)
Flawed experiments can tell you nothing. Whether they have high draw rate or not. You will always measure the sum of the flaws and the engine strength.

All experiments are flawed to some degree, because humans are not perfect and so we design imperfectly. We also cannot measure perfectly or calculate perfectly. And even if we could design a perfect experiment, it is not immune to probability

Note, however, that the experiment must be very flawed in order to produce an 8-0 result. This would only be possible in the most careless design. Like making sure that each program uses the same machine the same number of times.

The chance of an 8-0 result is 1 out of 256 so if we run an experiment enough times it is sure to happen.

The one at the end of your long string of zeros is without a doubt, dirt (if it came from empirical experiments and measurements).
Not if the experiment involved 10^100 games. You can really measure things very precisely with such a large number of games. Counting with integers involves no loss of precision, and a counter that can count to 10^100 is actually a squite small machine. You could afford hundreds of them, and cross-check those to see if one is in error or not.

But the 10^100 was just proverbial, so nitpicking over it makes no sense. In reality you could never play 10^100 games, not even if you turned all matter in the Universe into PCs and set them playing, before the Universe collapsed into a black hole, or all protons in it decayed to positrons. For the argument you are trying to make, having 8 wins + a billion draws (and no losses) would be just as effective. If a billion draws have zero impact on the LOS, 10^100 will have too, right? 10^100 times 0 is still zero.

My googol is covered in another post. The purpose of choosing a googol is that it would make the result obvious to any thinking person.
The same idea works with a million games. It is just that the result is not quite as extreme or obvious. But if two engines have a million -8 draws and a single engine wins the 8 non-drawn games, they are still the same strength.

hgm · Post by **hgm** » Wed Jul 01, 2020 12:01 am

Dann Corbit wrote: ↑Tue Jun 30, 2020 11:43 pm I can run a billion, quadrillion, or googol or even googolplex trials in a gedankenexperiment. That's what is so nice about them.
And a googol draws tells us that the engines are equal.
A few wins for either side tells us exactly nothing about superiority in that case.

What is also nice about gedankenexperiments is that you can make them free of technical flaws, run on hardware that never fails, recording the result with counters that never miss a tick. Not that they would suffer much wear, if they only have to tick 8 times...

hgm · Post by **hgm** » Wed Jul 01, 2020 12:15 am

Dann Corbit wrote: ↑Tue Jun 30, 2020 11:58 pmAnd when we divide 917 by 1000 using our integers we get an answer of zero.
That is why we tend to do floating point calculations. And even an 8 byte integer cannot hold a number bigger than 2^64 - 1.

Well, so use a 100-byte integer then. Even in 1987 on my 6509-based home-built computer I was doing math calculations with a precision of 80 digits.

I cannot speak for you, but when I was in elementary school they taught me this wonderful concept of fractions. Later in highschool this was formalized to the theory of rational numbers; and it was proven that this set is closed for division. You can do any calculation with +, -, * and / without any loss of precision.

All experiments are flawed to some degree, because humans are not perfect and so we design imperfectly. We also cannot measure perfectly or calculate perfectly.

We can in gedanken experiments. If you can play 10^100 games of chess, you should certainly be able to count to 10^100; that is a comparatively extremely simple task. In fact it even requires less logic than playing a single chess game. If you step back into reality, and take a billion games... It is completely trivial to count to a billion without errors.

The chance of an 8-0 result is 1 out of 256 so if we run an experiment enough times it is sure to happen.

Yes, but the other results would happen much more often, if they were random. That is the point: you get a resut that is unlikely to be the result of random chance. Which makes it likely that it is the result of something systematic. And if you know your job, that something else could only be the engine performance.

Dann Corbit · Post by **Dann Corbit** » Wed Jul 01, 2020 12:29 am

hgm wrote: ↑Wed Jul 01, 2020 12:15 am
Dann Corbit wrote: ↑Tue Jun 30, 2020 11:58 pmAnd when we divide 917 by 1000 using our integers we get an answer of zero.
That is why we tend to do floating point calculations. And even an 8 byte integer cannot hold a number bigger than 2^64 - 1.
Well, so use a 100-byte integer then. Even in 1987 on my 6509-based home-built computer I was doing math calculations with a precision of 80 digits.

I cannot speak for you, but when I was in elementary school they taught me this wonderful concept of fractions. Later in highschool this was formalized to the theory of rational numbers; and it was proven that this set is closed for division. You can do any calculation with +, -, * and / without any loss of precision.

Yes, I have calculated pi to thousands of correct digits using numerical integration using only fractions. I used the MIRACL rational library and the fractional nodes and weights of Recursive Monotone Stable numerical integration.
Even so, experiments have measurement errors, calculation errors, systematic errors, and random errors.

All experiments are flawed to some degree, because humans are not perfect and so we design imperfectly. We also cannot measure perfectly or calculate perfectly.
We can in gedanken experiments. If you can play 10^100 games of chess, you should certainly be able to count to 10^100; that is a comparatively extremely simple task. In fact it even requires less logic than playing a single chess game. If you step back into reality, and take a billion games... It is completely trivial to count to a billion without errors.

But running a billion chess games without randomness has no chance of happening, because the outcomes are somewhat random. Far more random than the 8 events out of a googol we chose to decide an engine was superior, and ignoring, all the while, the googol draws that prove beyond any shadow of a doubt that the engines have exactly the same strength

The chance of an 8-0 result is 1 out of 256 so if we run an experiment enough times it is sure to happen.
Yes, but the other results would happen much more often, if they were random. That is the point: you get a resut that is unlikely to be the result of random chance. Which makes it likely that it is the result of something systematic. And if you know your job, that something else could only be the engine performance.

My little coin flipper shows that with millions of games, there will be a large spread of possible outcomes, with exact equality being quite unlikely. The larger the number of games, the bigger the spread of possible outcomes. Of course, as a ratio compared to the entire data set, the outcome will veer towards unity and eventually arrive to withing an infinitesimal distance from showing the engines are equal (as a ratio). But we would clearly expect to be off by millions in raw outcome numbers one way or another, if we ran a googol trials. That is why 8 wins means nothing.

((one googol / 2) - one million) / (one googol) is so close to one half we should be thrilled that we only had one million to one side or the other.

Dann Corbit · Post by **Dann Corbit** » Wed Jul 01, 2020 12:48 am

hgm wrote: ↑Tue Jun 30, 2020 11:55 pm But it doesn't make absurd predictions; it makes accurate predictions. It is just that your thinking is at odds with reality.

An engine A that is infintely stronger than engine B has exactly the same probability of being stronger as an engine with exactly the same strength.
That is absurd.

Either the Elo calculation is wrong, or the LOS calculation is wrong.
Elo says one is infinitely stronger than the other and the other set has exactly the same strength.
Yet LOS says both cases have exactly the same odds that A is stonger than B.

That is absurd.
They cannot both be right.

Ovyron · Post by **Ovyron** » Wed Jul 01, 2020 1:08 am

Dann Corbit wrote: ↑Tue Jun 30, 2020 11:34 pm Maybe we do have difficulty understanding each other. I guess my problem is that when something makes no sense to me, I don't believe it.

I could, of course, be wrong. I am wrong a lot. But when the outcome of a model says something stupid, I think the model is wrong.

So this thread becomes about trying to make LOS make sense to you.

First, let's talk about "Superiority". It is a thing that exists that tells you who is factually stronger.

Superiority is either 1 if it exists or 0 if it doesn't. Note 0 can exist on both sides if an engine plays itself (I won't say "identical", because 0 0 superiority is just self-play.)

Now, let's say there's these things:

10^100 entities called "A"

10^100 entities called "B"

As are factually superior to B. The LOS of A over B entities is 100%.

Now, we have them play 10^100 games. Whenever As play each other there's mostly draws, and whenever B plays each other there's mostly draws, otherwise As always have more wins than Bs when they play each other, but A could just be Superior to B by 0.000000...add a bunch of zeroes...00000001 ELO.

AFTER that's in place, you have this result:

(10^100)-8 games were drawn between ENTITY ONE and ENTITY TWO. ENTITY ONE won 8 games.

What LOS is trying to answer is what is the chance that ENTITY ONE is from the As and ENTITY TWO is from the Bs

That's all, and in most scenarios, it'll happen in fact that ENTITY ONE was A and ENTITY TWO was B.

LOS was right most of the time.

A single case where A plays A or B plays B and we get this result and LOS is wrong ignores all the other possible cases.

Hope this clears thing up, LOS is trying to guess the chance that superiority exists for one of the sides (which includes all these possibilities), and you don't need to include draws for this.

Pio · Post by **Pio** » Wed Jul 01, 2020 1:30 am

Dann Corbit wrote: ↑Wed Jul 01, 2020 12:48 am
hgm wrote: ↑Tue Jun 30, 2020 11:55 pm But it doesn't make absurd predictions; it makes accurate predictions. It is just that your thinking is at odds with reality.
An engine A that is infintely stronger than engine B has exactly the same probability of being stronger as an engine with exactly the same strength.
That is absurd.

Either the Elo calculation is wrong, or the LOS calculation is wrong.
Elo says one is infinitely stronger than the other and the other set has exactly the same strength.
Yet LOS says both cases have exactly the same odds that A is stonger than B.

That is absurd.
They cannot both be right.

Hi Dann!

They can, and are, both right.

It is as likely that I would win a marathon as it is a snail would win a marathon even though I am a lot faster than a snail.

100 ten bills are bigger than 100 nine bills and so are 100 billion bills compared to 100 nine bills. It does not matter how much bigger each bill is.

/Pio

Dann Corbit · Post by **Dann Corbit** » Wed Jul 01, 2020 1:48 am

They can both be right, I have no problem with that.
But you do not have the same probability to win the marathon than the snail does.
You also might get hit by a truck and the snail would beat you.

My problem with the calculation is that it says the odds that one is superior is identical.

Clearly, that is wrong.
And if there are about a googol pieces of evidence saying they are the same strength, I think that is a lot more convincing than one win out of a googol divided by 8.

I don't think the math is wrong (now that I have looked it over). But I am quite sure that the model is wrong.
I don't think I have ever been more sure of anything in my life.
I find it strange that people can think something that draws a googol times and loses only 8 is weaker. Because it's not.

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo