Throwing out draws to calculate Elo
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 11265
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
Let me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
 hgm
 Posts: 24667
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: Throwing out draws to calculate Elo
They would be if the engines played perfect chess. They would then always be zero, because all games would end in a draw.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:34 pmBut if absolultely identically equal engines play each other thousands of times, the number of wins and losses will not be the same.
But Stockfish does not play perfect chess. It loses games. Play a good TicTacToe engine against itself. Then the number of wins will always be the same.You know this, play stockfish against itself for one hundred games.
The problem is that you don't grasp what is 'small'. We discussed this before in connection with a WCCC result, and it turned out you are really clueless where statistics is concerned. Statistically speaking, 80 is NOT a small difference. 508500 is a small difference, however. Whether a difference is statistically significant or not depends on the ratio of the difference to the standard deviation. If the result is almost always a draw, the standard deviation is almost zero.Hence, a small difference in wins and losses does not tell us which engine is stronger. In order to know if it *might* be stronger, it must be outside of the error bands.
That depends. For engines that always draw there is zero noise; the match results are not random at all.Your accidental forfeit once in a million games will not show up in a contest of equal engines, because the noise of randomness will far exceed the once in a million loss.
So you insist. And you are wrong. Wrong will not turn into right no matter how persistent you are.And an engine with an infinte Elo advantage should not have the same LOS as an engine with a zero Elo advantage. It could still be stronger, but the probability should be different.
You have difficulty understanding everyone else. We understand perfectly what you are saying, and we know it to be wrong.Maybe we do have difficulty understanding each other. I guess my problem is that when something makes no sense to me, I don't believe it.
But it doesn't make absurd predictions; it makes accurate predictions. It is just that your thinking is at odds with reality.I think that model is based only on math and not on probability. Otherwise it would not make absurd predictions.

 Posts: 11265
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
And when we divide 917 by 1000 using our integers we get an answer of zero.hgm wrote: ↑Tue Jun 30, 2020 9:40 pmNot if they are integers.Dann Corbit wrote: ↑Tue Jun 30, 2020 8:36 pmSumming a column of numbers in a computer gives two different answers, depending on the direction of summation (numerical calculation error).
That is why we tend to do floating point calculations. And even an 8 byte integer cannot hold a number bigger than 2^64  1.
All experiments are flawed to some degree, because humans are not perfect and so we design imperfectly. We also cannot measure perfectly or calculate perfectly. And even if we could design a perfect experiment, it is not immune to probabilityFlawed experiments can tell you nothing. Whether they have high draw rate or not. You will always measure the sum of the flaws and the engine strength.The design of the experiment can be imperfect (e.g. one machine is ever so slightly stronger than the other so we are not measuring the strength of the software difference but of the hardware difference)
The chance of an 80 result is 1 out of 256 so if we run an experiment enough times it is sure to happen.
Note, however, that the experiment must be very flawed in order to produce an 80 result. This would only be possible in the most careless design. Like making sure that each program uses the same machine the same number of times.
My googol is covered in another post. The purpose of choosing a googol is that it would make the result obvious to any thinking person.Not if the experiment involved 10^100 games. You can really measure things very precisely with such a large number of games. Counting with integers involves no loss of precision, and a counter that can count to 10^100 is actually a squite small machine. You could afford hundreds of them, and crosscheck those to see if one is in error or not.The one at the end of your long string of zeros is without a doubt, dirt (if it came from empirical experiments and measurements).
But the 10^100 was just proverbial, so nitpicking over it makes no sense. In reality you could never play 10^100 games, not even if you turned all matter in the Universe into PCs and set them playing, before the Universe collapsed into a black hole, or all protons in it decayed to positrons. For the argument you are trying to make, having 8 wins + a billion draws (and no losses) would be just as effective. If a billion draws have zero impact on the LOS, 10^100 will have too, right? 10^100 times 0 is still zero.
The same idea works with a million games. It is just that the result is not quite as extreme or obvious. But if two engines have a million 8 draws and a single engine wins the 8 nondrawn games, they are still the same strength.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
 hgm
 Posts: 24667
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: Throwing out draws to calculate Elo
What is also nice about gedankenexperiments is that you can make them free of technical flaws, run on hardware that never fails, recording the result with counters that never miss a tick. Not that they would suffer much wear, if they only have to tick 8 times...Dann Corbit wrote: ↑Tue Jun 30, 2020 9:43 pmI can run a billion, quadrillion, or googol or even googolplex trials in a gedankenexperiment. That's what is so nice about them.
And a googol draws tells us that the engines are equal.
A few wins for either side tells us exactly nothing about superiority in that case.
 hgm
 Posts: 24667
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: Throwing out draws to calculate Elo
Well, so use a 100byte integer then. Even in 1987 on my 6509based homebuilt computer I was doing math calculations with a precision of 80 digits.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:58 pmAnd when we divide 917 by 1000 using our integers we get an answer of zero.
That is why we tend to do floating point calculations. And even an 8 byte integer cannot hold a number bigger than 2^64  1.
I cannot speak for you, but when I was in elementary school they taught me this wonderful concept of fractions. Later in highschool this was formalized to the theory of rational numbers; and it was proven that this set is closed for division. You can do any calculation with +, , * and / without any loss of precision.
We can in gedanken experiments. If you can play 10^100 games of chess, you should certainly be able to count to 10^100; that is a comparatively extremely simple task. In fact it even requires less logic than playing a single chess game. If you step back into reality, and take a billion games... It is completely trivial to count to a billion without errors.All experiments are flawed to some degree, because humans are not perfect and so we design imperfectly. We also cannot measure perfectly or calculate perfectly.
Yes, but the other results would happen much more often, if they were random. That is the point: you get a resut that is unlikely to be the result of random chance. Which makes it likely that it is the result of something systematic. And if you know your job, that something else could only be the engine performance.The chance of an 80 result is 1 out of 256 so if we run an experiment enough times it is sure to happen.

 Posts: 11265
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
Yes, I have calculated pi to thousands of correct digits using numerical integration using only fractions. I used the MIRACL rational library and the fractional nodes and weights of Recursive Monotone Stable numerical integration.hgm wrote: ↑Tue Jun 30, 2020 10:15 pmWell, so use a 100byte integer then. Even in 1987 on my 6509based homebuilt computer I was doing math calculations with a precision of 80 digits.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:58 pmAnd when we divide 917 by 1000 using our integers we get an answer of zero.
That is why we tend to do floating point calculations. And even an 8 byte integer cannot hold a number bigger than 2^64  1.
I cannot speak for you, but when I was in elementary school they taught me this wonderful concept of fractions. Later in highschool this was formalized to the theory of rational numbers; and it was proven that this set is closed for division. You can do any calculation with +, , * and / without any loss of precision.
Even so, experiments have measurement errors, calculation errors, systematic errors, and random errors.
But running a billion chess games without randomness has no chance of happening, because the outcomes are somewhat random. Far more random than the 8 events out of a googol we chose to decide an engine was superior, and ignoring, all the while, the googol draws that prove beyond any shadow of a doubt that the engines have exactly the same strengthWe can in gedanken experiments. If you can play 10^100 games of chess, you should certainly be able to count to 10^100; that is a comparatively extremely simple task. In fact it even requires less logic than playing a single chess game. If you step back into reality, and take a billion games... It is completely trivial to count to a billion without errors.All experiments are flawed to some degree, because humans are not perfect and so we design imperfectly. We also cannot measure perfectly or calculate perfectly.
My little coin flipper shows that with millions of games, there will be a large spread of possible outcomes, with exact equality being quite unlikely. The larger the number of games, the bigger the spread of possible outcomes. Of course, as a ratio compared to the entire data set, the outcome will veer towards unity and eventually arrive to withing an infinitesimal distance from showing the engines are equal (as a ratio). But we would clearly expect to be off by millions in raw outcome numbers one way or another, if we ran a googol trials. That is why 8 wins means nothing.Yes, but the other results would happen much more often, if they were random. That is the point: you get a resut that is unlikely to be the result of random chance. Which makes it likely that it is the result of something systematic. And if you know your job, that something else could only be the engine performance.The chance of an 80 result is 1 out of 256 so if we run an experiment enough times it is sure to happen.
((one googol / 2)  one million) / (one googol) is so close to one half we should be thrilled that we only had one million to one side or the other.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

 Posts: 11265
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
An engine A that is infintely stronger than engine B has exactly the same probability of being stronger as an engine with exactly the same strength.
That is absurd.
Either the Elo calculation is wrong, or the LOS calculation is wrong.
Elo says one is infinitely stronger than the other and the other set has exactly the same strength.
Yet LOS says both cases have exactly the same odds that A is stonger than B.
That is absurd.
They cannot both be right.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: Throwing out draws to calculate Elo
So this thread becomes about trying to make LOS make sense to you.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:34 pmMaybe we do have difficulty understanding each other. I guess my problem is that when something makes no sense to me, I don't believe it.
I could, of course, be wrong. I am wrong a lot. But when the outcome of a model says something stupid, I think the model is wrong.
First, let's talk about "Superiority". It is a thing that exists that tells you who is factually stronger.
Superiority is either 1 if it exists or 0 if it doesn't. Note 0 can exist on both sides if an engine plays itself (I won't say "identical", because 0 0 superiority is just selfplay.)
Now, let's say there's these things:
10^100 entities called "A"
10^100 entities called "B"
As are factually superior to B. The LOS of A over B entities is 100%.
Now, we have them play 10^100 games. Whenever As play each other there's mostly draws, and whenever B plays each other there's mostly draws, otherwise As always have more wins than Bs when they play each other, but A could just be Superior to B by 0.000000...add a bunch of zeroes...00000001 ELO.
AFTER that's in place, you have this result:
(10^100)8 games were drawn between ENTITY ONE and ENTITY TWO. ENTITY ONE won 8 games.
What LOS is trying to answer is what is the chance that ENTITY ONE is from the As and ENTITY TWO is from the Bs
That's all, and in most scenarios, it'll happen in fact that ENTITY ONE was A and ENTITY TWO was B.
LOS was right most of the time.
A single case where A plays A or B plays B and we get this result and LOS is wrong ignores all the other possible cases.
Hope this clears thing up, LOS is trying to guess the chance that superiority exists for one of the sides (which includes all these possibilities), and you don't need to include draws for this.
Re: Throwing out draws to calculate Elo
Hi Dann!Dann Corbit wrote: ↑Tue Jun 30, 2020 10:48 pmAn engine A that is infintely stronger than engine B has exactly the same probability of being stronger as an engine with exactly the same strength.
That is absurd.
Either the Elo calculation is wrong, or the LOS calculation is wrong.
Elo says one is infinitely stronger than the other and the other set has exactly the same strength.
Yet LOS says both cases have exactly the same odds that A is stonger than B.
That is absurd.
They cannot both be right.
They can, and are, both right.
It is as likely that I would win a marathon as it is a snail would win a marathon even though I am a lot faster than a snail.
100 ten bills are bigger than 100 nine bills and so are 100 billion bills compared to 100 nine bills. It does not matter how much bigger each bill is.
/Pio

 Posts: 11265
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
They can both be right, I have no problem with that.
But you do not have the same probability to win the marathon than the snail does.
You also might get hit by a truck and the snail would beat you.
My problem with the calculation is that it says the odds that one is superior is identical.
Clearly, that is wrong.
And if there are about a googol pieces of evidence saying they are the same strength, I think that is a lot more convincing than one win out of a googol divided by 8.
I don't think the math is wrong (now that I have looked it over). But I am quite sure that the model is wrong.
I don't think I have ever been more sure of anything in my life.
I find it strange that people can think something that draws a googol times and loses only 8 is weaker. Because it's not.
But you do not have the same probability to win the marathon than the snail does.
You also might get hit by a truck and the snail would beat you.
My problem with the calculation is that it says the odds that one is superior is identical.
Clearly, that is wrong.
And if there are about a googol pieces of evidence saying they are the same strength, I think that is a lot more convincing than one win out of a googol divided by 8.
I don't think the math is wrong (now that I have looked it over). But I am quite sure that the model is wrong.
I don't think I have ever been more sure of anything in my life.
I find it strange that people can think something that draws a googol times and loses only 8 is weaker. Because it's not.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.