Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 11:56 pm

The MT is the best PRNG for non-crypto work.
The reason is that almost all pseudo random number generators will fragment into planes when you have multiple dimentions. That is why it is not the basis PRNG for many programming languages.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC285899/

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 12:01 am

If I grab an entry from the database, chosen at random, how close is it to 0.5?

This data is from testing a perfectly fair penny flipper against itself, and so it demonstrates the noise in testing against two engines of identical strength. We are asking the question, "Will the LOS agorithm correctly diagnose the LOS as "not superior" if the two engines have identical strength?

MonteCarlo · Post by **MonteCarlo** » Fri Jul 03, 2020 12:37 am

Well, before we get wrapped up in the shifting goal posts, let's be clear about something. My previous post was directed towards the claim that the distribution of LOS scores changes (specifically, shifts towards values further from 0.5) as you play more games.

The posted query output shows that the distribution is (to some sensible level of precision) exactly as we'd expect. Are you really claiming that you just happened to post data from the one match length where the distribution is miraculously as we'd expect for matches between two exactly equal "engines"?

If, instead, that is just the distribution of LOS scores in matches between exactly equal "engines" (which, if you give it a little thought, isn't surprising since the metric is based on the likelihood of getting results as extreme as the observed results from two equal engines), then the answer to your new question will also, on average, be the same regardless of match length.

None of this is even a little surprising given what the metric is.

This is all a bit like abusing p-values, and then deciding they're completely useless measurements because you realize that you can't actually infer anything about the probability of your alternate hypothesis from p-values alone.

My main interest in all of this (after the first couple pages) was to see if your data actually did indicate the things you claimed. Now that I see it doesn't, my interest is a bit diminished.

You clearly don't like some of the properties of LOS. That's fine. There's a rather large difference between your disliking that you can't use a measurement in some way (here that the probability of a measured LOS "near" 0.5 doesn't approach 100% for matches of increasing length between exactly equal engines), and claiming that it is fundamentally flawed and that people who disagree with you aren't thinking people.

Cheers!

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 12:43 am

Yes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).

At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.

Pio · Post by **Pio** » Fri Jul 03, 2020 1:11 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 12:43 am Yes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).

At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.

Hi Dann!

The thing is that the enormous number of data points that are draws does not say anything about which of the players is better. And if it says something it actually says the opposite of what your intuition is telling you. As I have said in some other posts having lots of draws actually make the wins count more if you think not all draws are equal, i.e if you think that some draws are closer to wins than other draws and the reason is that if you assume that the players with no wins is better it would be highly unlikely he would have played so many draws without any wins.

/Pio

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 1:16 am

Pio wrote: ↑Fri Jul 03, 2020 1:11 am
Dann Corbit wrote: ↑Fri Jul 03, 2020 12:43 am Yes, so far we have not even addressed my difficulty with throwing out the draw information.
I do think that the fact that LOS has enormous difficulty determining that two engines of equal strength are equal and the fact that it is based on the spread between wins and losses should give us pause when applying it to very long matches. And I also doubt the accuracy when based on a very short sequence of games (but of course this is not different from any other statistical measure).

At some point, I think it makes sense to address why throwing out an enormous number of data points that describe equality and consuming a tiny number of points that describe inequality makes sense.
Hi Dann!

The thing is that the enormous number of data points that are draws does not say anything about which of the players is better. And if it says something it actually says the opposite of what your intuition is telling you. As I have said in some other posts having lots of draws actually make the wins count more if you think not all draws are equal, i.e if you think that some draws are closer to wins than other draws and the reason is that if you assume that the players with no wins is better it would be highly unlikely he would have played so many draws without any wins.

/Pio

This is a really interesting idea and I will have to think long and hard about it.

My difficulty is this:
If my opponent is better than me, it is difficult even to achieve a draw, especially if he/she/they/it are a lot better than me.
And so, if I see one hundred draws, that seems to be sending me a big signal of "Equality, equality, equality..." and so collecting an enormous amount of this kind of data indicates equality to me. On the other hand, I must admit that the Monte Hall problem and the Birthday paradox were hard for me to understand until I really understood properly the math behind them. But I cannot accept it until I understand it. If I am wrong, I do hope to somehow understand why the equality data does not matter. My problem is that I suspect the model (not the math). So even though the math works, I do not feel convinced that it is right.

syzygy · Post by **syzygy** » Fri Jul 03, 2020 1:42 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 7:19 am
syzygy wrote: ↑Thu Jul 02, 2020 2:40 am Instead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.

We run a match until we have 8 decided games.
The match results in N games, i.e. N-8 drawn games and 8 decided games It turns out that A has won all decided games.

What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.

Do you agree? (I would hope you do.)

Does any of this depend on the value of N?
I partly agree with it.
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.

There is no error bar in 1/256.

But I will repeat my question, which you did not answer: does any of this depend on N? On the number of draws?

syzygy · Post by **syzygy** » Fri Jul 03, 2020 1:48 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 6:59 am
syzygy wrote: ↑Thu Jul 02, 2020 2:30 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 1:42 am
syzygy wrote: ↑Wed Jul 01, 2020 11:42 pm
Dann Corbit wrote: ↑Wed Jul 01, 2020 2:08 am Nice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.
No, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.

A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not
And if engine A draws engine B 99.99999999% of the time and beats engine B the remaining 0.0000001% of the time, would you agree that A is superior?

If these numbers can be established with 100% certainty, would you agree that the LOS is 1?
If you run the experiment twice, that is not enough.
You seem to think that an engine emitting a win is deterministic. It is not.

I mean exactly what I write. I don't care whether my hypothesis applies to your two favorite engines. I am just trying to make it understandable that one engine can be clearly superior to another engine, yet almost equal in strength.

Maybe you are now admitting that this is possible, but in the beginning of this thread you certainly were heavily denying that that made sense.

So maybe there is progress?

Dann Corbit · Post by **Dann Corbit** » Fri Jul 03, 2020 2:05 am

syzygy wrote: ↑Fri Jul 03, 2020 1:42 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 7:19 am
syzygy wrote: ↑Thu Jul 02, 2020 2:40 am Instead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.

We run a match until we have 8 decided games.
The match results in N games, i.e. N-8 drawn games and 8 decided games It turns out that A has won all decided games.

What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.

Do you agree? (I would hope you do.)

Does any of this depend on the value of N?
I partly agree with it.
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.
There is no error bar in 1/256.

But I will repeat my question, which you did not answer: does any of this depend on N? On the number of draws?

The error from one million data points is not the same as the error from ten data points.
If I have 10 data points, I will be surprised very much by three sports.
If I have one million data points, I will be surprised if there are not three sports.

syzygy · Post by **syzygy** » Fri Jul 03, 2020 2:07 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 2:05 am
syzygy wrote: ↑Fri Jul 03, 2020 1:42 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 7:19 am
syzygy wrote: ↑Thu Jul 02, 2020 2:40 am Instead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.

We run a match until we have 8 decided games.
The match results in N games, i.e. N-8 drawn games and 8 decided games It turns out that A has won all decided games.

What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.

Do you agree? (I would hope you do.)

Does any of this depend on the value of N?
I partly agree with it.
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.
There is no error bar in 1/256.

But I will repeat my question, which you did not answer: does any of this depend on N? On the number of draws?
The error from one million data points is not the same as the error from ten data points.
If I have 10 data points, I will be surprised very much by three sports.
If I have one million data points, I will be surprised if there are not three sports.

So I repeat my question again. Does the outcome 1/256 depend on the value of N? On the number of draws?

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo