Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 3:24 am

That should be 10,000 games, not 100,000 because sqrt(10,000)=100

Ovyron · Post by **Ovyron** » Mon Jul 06, 2020 9:11 am

Dann Corbit wrote: ↑Sun Jul 05, 2020 9:05 pm I think maybe it is a good time to retire the discussion.

No! Don't give up! You don't need to convince anyone that you're right, only to have them see it from your perspective. I can't see it from your perspective at all, but perhaps it's because I lack vision.

What we need to do is finding an extreme case where we agree. If we can't find it, then I won't be able to walk in your choices, but if there's one, perhaps we can understand each other's view.

First, there's engine A and engine B, we gedankenexperiment 830000 games. The result:

Engine A performance:

140000 Wins - 90000 Loses - 600000 Draws

Elo says Engine A has +21 Elo.

LOS says Engine A is more likely superior.

Are you okay with this LOS?

If you're not okay with that LOS I give up, we can't ever agree on these issues.

But if you do, then I reveal that Engine A and Engine B are the same! 600000 games were drawn and 230000 were decided. The decided games were assigned at random because when an engine plays itself it doesn't matter what side counts the wins.

So LOS was wrong, because engines are the same, it should have shown a LOS of 0.5. But it doesn't have a way to tell the difference between this case and the case where Engine A and Engine B are actually different and got this result.

I hope this illustrates how you sound to me.

hgm · Post by **hgm** » Mon Jul 06, 2020 9:22 am

Dann Corbit wrote: ↑Mon Jul 06, 2020 1:53 am Your conjecture is correct. The question is, will the actual data points be buried by the random error that is inherent in the technique? If you look at the data provided, the randomness envelope may or may not overwhelm the actual data. The more observations you have, the less reliable the data becomes, which is the opposite of how most statistics work. So you have a sort of irony. If you have only a few observations, it is less likely that the data was buried under random noise. But at the same time, since you only have a few observations the data is less reliable for that reason.

Now, I also agree that as superiority becomes more and more pronounced, it is more and more likely to show up "piercing the cloud of randomness" as it were and at some point it would take a huge number of observations to bury it (perhaps billions). But at the same time, if there really is an enormous difference in strength, Elo will already have told us this and we will not need any tiebreaker to decide which engine is stronger.

Meaningless babble, like all the other postings since you started talking to yourself. Whether or not the randomness overwhelms the measurement is answered by the error bars, and well understood. The more observations you have, the more reliable the data becomes. I guess you confuse that with " The more I cheat, the less reliable the data becomes". So your following 'conclusions' are complete bullshit.

And you keep lying about Elo; it would of course have told us which one is stronger in exactly the same way as LOS does.

Michel · Post by **Michel** » Mon Jul 06, 2020 10:35 am

This whole discussion is very strange. Lots of useless words for things which are well understood and well known.

LOS is a Bayesian concept. Hence it is not an empirical probability but a degree of belief. Formally LOS is the probability that one engine is stronger than another assuming a uniform prior. A uniform prior is of course a complete fiction. If LOS was truly what people think it is (the empirical probability that engine A is stronger than engine B) then one could stop any test whenever LOS>95% and by now everyone knows that this does not work.

A more useful concept than LOS is p-value. The p-value of a test is the probability of a more extreme result than the observation, assuming the null hypothesis (equality of Elo) is true. If the null hypothesis is true then the p-value is indeed uniformly distributed. For fixed length tests the p-value happens to coincide with LOS but this is far from true for sequential tests.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 11:37 am

Michel wrote: ↑Mon Jul 06, 2020 10:35 am This whole discussion is very strange. Lots of useless words for things which are well understood and well known.

LOS is a Bayesian concept. Hence it is not an empirical probability but a degree of belief. Formally LOS is the probability that one engine is stronger than another assuming a uniform prior. A uniform prior is of course a complete fiction. If LOS was truly what people think it is (the empirical probability that engine A is stronger than engine B) then one could stop any test whenever LOS>95% and by now everyone knows that this does not work.

A more useful concept than LOS is p-value. The p-value of a test is the probability of a more extreme result than the observation, assuming the null hypothesis (equality of Elo) is true. If the null hypothesis is true then the p-value is indeed uniformly distributed. For fixed length tests the p-value happens to coincide with LOS but this is far from true for sequential tests.

I disagree that LOS is Baysean ( in entirety ). i consider my philosophy Bayesian because we should adjust our outlook based on new information, LOS ignores draws, which are important information about strength as evidenced by Elo calculation, On the other hand, the math itself is not illogical,

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 11:40 am

On the other hand, the p-value explanation is intereresting. I am not sure that I understand the nuance of what you are saying, so I will have to think about it.

The most important thing is to choose the tools that work, and then use them to do the job. To know the strengths and weaknesses of the tools and to understand when they help you and when they let you down.

Ovyron · Post by **Ovyron** » Mon Jul 06, 2020 11:47 am

Dann Corbit wrote: ↑Mon Jul 06, 2020 11:40 am The most important thing is to choose the tools that work, and then use them to do the job.

LOS works and does the job for what it does, I still don't know what you want it to do (something about identical engines that you know a priori are the same so you don't need LOS for the job?)

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 11:55 am

I demonstrated with concrete values that it really does not work.

If you were to tune an engine using LOS, it will not work correctly like Elo or other tests that are not destroyed by the ordinary properties of randomness.

I do not believe that in the general case it provides any useful data beyond what Elo gives.
If we need a tiebreaker, I also distrust the standard tie breakers, because I do not understand their mathematical basis, but they might be better.

i suggest as an alternative (although it is much more expensive to calculate than LOS, it is sill much easier than normal ELO) is the fast strength calculation developed recently by Ed Schroder and Chris Whittington.

I think that the real way to decide the right path is empirically, by measurement. Try some models and find out what works best.

I don't think LOS is used for programming decisions, since the Fishtest builds always report Elo differences and say nothing about LOS, but I could be wrong about that.

I find the strong support for the status quo a little strange when it comes to LOS, not just because I don't think it works well, but because really smart people say strange things about it. I don't understand that,

hgm · Post by **hgm** » Mon Jul 06, 2020 12:48 pm

Dann Corbit wrote: ↑Mon Jul 06, 2020 11:55 am I demonstrated with concrete values that it really does not work.

The same lie again. You demonstrated in fact the opposit.

If you were to tune an engine using LOS, it will not work correctly like Elo or other tests that are not destroyed by the ordinary properties of randomness.

Lie!

I do not believe that in the general case it provides any useful data beyond what Elo gives.

Only true if you would know all the covariances in the Elo calculations.

I don't think LOS is used for programming decisions, since the Fishtest builds always report Elo differences and say nothing about LOS, but I could be wrong about that.

As they are the same thing, you are wrong about that by definition.

I find the strong support for the status quo a little strange when it comes to LOS, not just because I don't think it works well, but because really smart people say strange things about it. I don't understand that,

The only person that says anything strange here is you. What all others say is consistent and obvious.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 06, 2020 12:55 pm

The only one that really puzzles me is hgm, I mean you are a physicist. The Heisenburg uncertainty principle, the laws of thermodynamics, quantum physics, Schrodinger's cat. If there is any person on this board who should understand uncertainty, it is hgm, but you have no understanding of uncertainty whatsoever, or pretend not to have it.

Anyone who says that 8 values out of ten to the one hundredth power are significant is either delusional or deliberately dense. I don't see you as either, so I do not understand your responses.

But hey, to each his own

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo