Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 2:12 am

Yes and no, in that you can call it 1/256, but the width of the error bars on the measurement should be much, much wider. In fact, so wide as to render the data meaningless.
So the number is the same. But the quality of the number is lower. Lower to the point of uselessness.

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 2:24 am

If the draws do not matter in understanding who is stronger, why does the Elo calculation get a totally wrong answer if you set the draws to zero?

I do understand we are looking for a razor turning point and not a magnitude. But I think it should be more obvious which is stronger if we know an engine is ten times stronger instead of .01% stronger. Or it should affect the size of our confidence interval.

Pio · Post by **Pio** » Fri Jul 03, 2020 2:28 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 1:16 am
Pio wrote: ↑Fri Jul 03, 2020 1:11 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 12:43 am Yes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).

At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.
Hi Dann!

The thing is that the enormous number of data points that are draws does not say anything about which of the players is better. And if it says something it actually says the opposite of what your intuition is telling you. As I have said in some other posts having lots of draws actually make the wins count more if you think not all draws are equal, i.e if you think that some draws are closer to wins than other draws and the reason is that if you assume that the players with no wins is better it would be highly unlikely he would have played so many draws without any wins.

/Pio
This is a really interesting idea and I will have to think long and hard about it.

My difficulty is this:
If my opponent is better than me, it is difficult even to achieve a draw, especially if he/she/they/it are a lot better than me.
And so, if I see one hundred draws, that seems to be sending me a big signal of "Equality, equality, equality..." and so collecting an enormous amount of this kind of data indicates equality to me. On the other hand, I must admit that the Monte Hall problem and the Birthday paradox were hard for me to understand until I really understood properly the math behind them. But I cannot accept it until I understand it. If I am wrong, I do hope to somehow understand why the equality data does not matter. My problem is that I suspect the model (not the math). So even though the math works, I do not feel convinced that it is right.

The model (LOS) does not tell you anything about how much better one of the players are. You can think of the coin flipping as -1 if tails and 1 as heads and 0 if it lands on the side. You can also think that 1 means I won 0 means draw and -1 means the opponent won. We now have a one dimensional discrete Brownian
motion. The zeroes will not affect the distance from the starting point so you can discard those. A Brownian motion will on average move from the centre proportional to the square root of The total Length travelled. That implies that it is not likely that two equal opponents will have LOS very close to 0.5.

If you for example double the size of coin flippings you could first say “hay I am so tired of double the coin flippings, why not be lazy and flip the coins the same number of times as the last time but if it is tails I go -2 steps and if it is heads I go 2 steps. Now you see that the distribution of the 2 sized steps are exactly like the 1 sized steps, I.e the step size does not matter. Now we have shown that the number of coin flips does not change the LOS. It will have the same distribution.

Now if one player is slightly better the centre will move proportionally more to either side depending on who is better and the random effect will still be proportional to the square root. Since sqrt(x) is less Than x the non random effect will dominate and will pull the result further and further from the centre

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 3:06 am

I agree that this can be true, but the tiny variance when we are right on the razor's edge can very easily be buried in noise.
If I have 100 data points, I will expect that 35 wins and 15 losses is important and indicates something about the difference between the two. But if I have 35 wins, 100 draws, and 15 losses they are now closer and I also expect more variance in the measurements. If I have 35 wins, 1000 draws and 15 losses, now they are very close together and I expect a lot of randomness. So many things are going against ignoring the draws as I see it.

If I know something is very, very close to the same strength, or if I know that something is incredibly mightier, that should affect my confidence about which one is stronger. That is why I do not think that throwing out the draws can be good, because it compresses the size of the apparent error and it also ignores the real strength difference might be infinitesimal or gigantic

I don't understand why these things do not bother other people

Milos · Post by **Milos** » Fri Jul 03, 2020 3:15 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 11:56 pm The MT is the best PRNG for non-crypto work.
The reason is that almost all pseudo random number generators will fragment into planes when you have multiple dimentions. That is why it is not the basis PRNG for many programming languages.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC285899/

You quote a paper from 1968???? Seriously?
Are you even aware that author of the PRNG in the link I posted is the same as of that paper? What does that tell you? It took him almost 40 years to design an excellent PRNG.
Do you even know any state-of-the-art tests for PRNGs?
Seems not. Just go and check MT performance on BigCrush, it's laughable.
Programming languages use it, because they are 1) inert, 2) ppl that make those kind of libs for them are dinosaurs like yourself.
The only on-paper advantage of MT is a huge period. But considering how slow is MT (like a snail really) even if you ran it for 100 years you still wouldn't reach the period of any other modern PRNG.

syzygy · Post by **syzygy** » Fri Jul 03, 2020 3:27 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 2:12 am Yes and no, in that you can call it 1/256, but the width of the error bars on the measurement should be much, much wider. In fact, so wide as to render the data meaningless.
So the number is the same. But the quality of the number is lower. Lower to the point of uselessness.

WHat do you mean error bar? Is 1/256 somehow more reliable if there were N=10^100 draws than if there were N=3 draws? Do you have any idea what it means you are saying?

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 3:40 am

syzygy wrote: ↑Fri Jul 03, 2020 3:27 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 2:12 am Yes and no, in that you can call it 1/256, but the width of the error bars on the measurement should be much, much wider. In fact, so wide as to render the data meaningless.
So the number is the same. But the quality of the number is lower. Lower to the point of uselessness.
WHat do you mean error bar? Is 1/256 somehow more reliable if there were N=10^100 draws than if there were N=3 draws? Do you have any idea what it means you are saying?

It is not more reliable, it is less reliable.
The number itself, 1/256 is the same. Yes, I agree with you about that. But the data spread we expect to see when you have10^100 draws are enormous. In fact, it is very shocking that we only have 8 wins. So shocking that it makes the data points look much more like sports.

If I flip a coin a billion times, I expect the ratio of heads to tails to be extremely close to half heads and half tails. But I will be quite surprised if I do not see a million or so in extra abundance of heads or tails. That would be accurate to one part in a thousand, which is fabulous. If we do not see some sort of spread like that, it makes me think that the billion draws are the real value and all the others are sports. And still leaves me wondering why there are not more sports.

Ovyron · Post by **Ovyron** » Fri Jul 03, 2020 7:02 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 9:00 pm You may not trust my code. So write your own simulation. It should take less than half and hour.

Nope. Nobody needs to run any simulations, because we have data from the real world. This link might look familiar:

https://ccrl.chessdom.com/ccrl/4040/index.html

They basically tested all the engines they could get their hands on, threw their old versions and more out, sorted them by elo, and used LOS to tell us how sure they are that their ordering is right.

If LOS was flawed then you could go and pick a pair of engines from there, and point out how LOS's numbers are wrong, and why, and then we'd be able to understand what you're talking about.

Because, so far all I've seen are talks that include numbers of games that can't be actually reached in reality, with draw percentages that can't actually be reached in reality (look at that, with actual real entities draw rate never goes over 90% after enough games are played) and seeking for a tool that would detect when you clone an engine, make it play itself, and tell you there's no superiority, even though you already know there's no superiority, and you ignore all the other cases that would produce the same outcome where engines are different and the superiority predicted by LOS is correct.

The most productive thing you can do is telling us the outcome you want to get with some input from an engine match, that doesn't care about identical engines (there's other tools to detect identical engines and they don't require playing any game), and that predicts superiority better than LOS. Perhaps then after all these discussions someone will design it and it'll replace LOS because it can do what LOS can't do, and doesn't have its flaws.

Until then, my biggest problem is I can't see the problem.

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 7:18 am

You mean you didn't see all those places where an engine is ranked higher than the one under it and yet the LOS is below 50? There are lots of them. That means Elo calculation says it is stronger, but LOS says it is weaker.

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 7:30 am

Milos wrote: ↑Fri Jul 03, 2020 3:15 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 11:56 pm The MT is the best PRNG for non-crypto work.
The reason is that almost all pseudo random number generators will fragment into planes when you have multiple dimentions. That is why it is not the basis PRNG for many programming languages.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC285899/
You quote a paper from 1968???? Seriously?
Are you even aware that author of the PRNG in the link I posted is the same as of that paper? What does that tell you? It took him almost 40 years to design an excellent PRNG.

I have exchanged emails with George Marsaglia many times. I don't know if he is still alive because it is a long time since I last communicated with him. At the time, he was the foremost expert in the world on PRNG and he invented most of the tests to expose problems in PRNGs.

Do you even know any state-of-the-art tests for PRNGs?

Sure, there is an American government agency that has a program suite you can use for that.

Seems not. Just go and check MT performance on BigCrush, it's laughable.
Programming languages use it, because they are 1) inert, 2) ppl that make those kind of libs for them are dinosaurs like yourself.
The only on-paper advantage of MT is a huge period. But considering how slow is MT (like a snail really) even if you ran it for 100 years you still wouldn't reach the period of any other modern PRNG.

It's plenty fast enough and does a wonderful job.
What replacement that does not have a patent or other intellectual encumbrances would you suggest as a replacement?

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo