TalkChess.com

Posted: **Mon Jun 29, 2020 10:57 pm**

What state of affairs is most likely will obviously depend on your prior for the distribution of strength differences (even winning 30/30 games in a match wouldn't allow a strong inference to superiority of the winning engine if your prior was that 9,000,000,000,000,000,000 out of 9,000,000,000,000,000,001 engines were exactly the same strength).

Engine A's winning every decisive game against engine B is always more likely when engine A is stronger than it is when engine A and engine B are equal.

A result where engine A wins every decisive game against engine B, then, will only fail to make it more likely that engine A is stronger than engine B if your prior makes pairings of equal engines sufficiently more likely than pairings of unequal engines (sufficiently more likely to offset the improbability of a weaker engine winning all X decisive games, which gets more difficult the higher X is), and this is true no matter the number of draws.

It's not like LOS and WILO are "wrong" because of this and some other measurement like Elo, Glicko, Ordo, etc. is "right". They just measure different things and have different properties we might think are more or less useful in some circumstances.

Cheers!

Posted: **Mon Jun 29, 2020 11:35 pm**

MonteCarlo wrote: ↑Mon Jun 29, 2020 10:57 pm ...

(sufficiently more likely to offset the improbability of a weaker engine winning all X decisive games, which gets more difficult the higher X is)

...

This should actually be:

(sufficiently more likely to offset the improbability of an equal engine winning all X decisive games, which gets more difficult the higher X is)

This is what I get for rewording things constantly and ending with a completely different sentence than I started with

Posted: **Tue Jun 30, 2020 1:44 am**

Whether it is a chess engine or a TIC-TAC-TOE or connect 5 or backgammon are all irrelevant.

The zero draws are due to the imaginary engine drawing only when the coin lands on the edge (exactly equal to 0.5).

Here is the point:
There is a big spread in possible outcomes.
That particular code (when compiled with gcc) produces exactly two outcomes that are matched 50,000 to 50,000 out of 1000.

So the other 998 contests would tell us that one of the "engines" was superior to the other.
Of course we know that the engines are exactly equal, simply using the top half or the bottom half of the interval (0,1) from a prng.

Here is an experiment where we know the engines are exactly equal, and it generates one thousand contests of 100,000 games each.

Even after a huge number of games, we do not have a certain outcome and a "result" that says Engine A is stronger or Engine B is stronger is simply wrong and furthermore, we know that it is wrong since the "engine" is nothing more than a totally fair coin flipper.

If a methodology is run against a control with a known outcome and produces a wrong result, then what does that tell us about the methodology?

I agree that if there were no such thing as randomness the "toss out draws" stuff would work fine for LOS and Elo. But in real life, chess engines, penny flips and turning light bulbs on and off all exhibit random behavior to some degree or another. A model that does not take this into account can produce absurd results like "Heads is Stronger Than Tails"

Posted: **Tue Jun 30, 2020 2:19 am**

As far as I'm aware no one's claimed that we infer anything with 100% certainty from any results.

Any rating system at all, accounting for draws or not, will sometimes give a higher rating to a weaker or equal engine if the right improbable results occur, so pointing out that this will sometimes happen with LOS or WILO isn't much of a reason to forever ignore them.

That equal engines can occasionally have a result where every decisive game is won by the same engine doesn't change the fact that the same result is more likely to occur if the winning engine is actually stronger.

As I mentioned before, unless your prior tells you that equal pairings are sufficiently more likely than unequal pairings (which is the case in your simulation; you already know that everything is equal by definition, so there's no point to inferring anything from the results), such a result (one side winning every decisive game) makes it more likely that it's the stronger engine than that it's equal or weaker.

Obviously we don't conclude that the winning engine is stronger with 100% certainty, and our confidence will depend both on our prior and on how many decisive games there were, but these sorts of considerations apply similarly to any rating/ranking metric, not just LOS or WILO.

Posted: **Tue Jun 30, 2020 2:49 am**

"Obviously we don't conclude that the winning engine is stronger with 100% certainty, and our confidence will depend both on our prior and on how many decisive games there were, but these sorts of considerations apply similarly to any rating/ranking metric, not just LOS or WILO."

It is mainly this part that I take issue with, namely because the number of random decisive games above and below the centroid widens with game count. Run my program with 100 games per contest, 1000, and 10,000. You will see that the band is much smaller as a percentage, but the absolute number of games that varies is larger and larger with contest size.

So, if I run a googol of games, and there are only 8 sports, those really are sports. And it does not take a googol of games. I used that as an absurd value so that it would be obvious that we cannot simply throw out draws as the number of games grows larger and larger.

It may be a pretty fair approximation for a normal band of some size, but if there are a huge number of games (most of which are draws) and we throw out the draws, we are going to be more and more claiming that randomness is decisive, and it isn't.

Posted: **Tue Jun 30, 2020 10:53 am**

Dann Corbit wrote: ↑Tue Jun 30, 2020 1:44 am Whether it is a chess engine or a TIC-TAC-TOE or connect 5 or backgammon are all irrelevant.

I think it should be. I mentioned Tic-Tac-Toe because it is entirely realistic to expect an engine to play it perfectly even today. So any loss, even once in a billion games, would be very significant, and would point to some technical flaw, be it software or hardware. While the engine that never loses behaves as expected. It seems entirely reasonable in such a case to consider the engine with the demonstrated flaw inferior.

The zero draws are due to the imaginary engine drawing only when the coin lands on the edge (exactly equal to 0.5).

Here is the point:
There is a big spread in possible outcomes.
That particular code (when compiled with gcc) produces exactly two outcomes that are matched 50,000 to 50,000 out of 1000.

So the other 998 contests would tell us that one of the "engines" was superior to the other.

No; that is pure abuse of statistical data. In most of those cases it would only tell you one of the "engines" has been more lucky than the other.

Of course we know that the engines are exactly equal, simply using the top half or the bottom half of the interval (0,1) from a prng.

Here is an experiment where we know the engines are exactly equal, and it generates one thousand contests of 100,000 games each.

Even after a huge number of games, we do not have a certain outcome and a "result" that says Engine A is stronger or Engine B is stronger is simply wrong and furthermore, we know that it is wrong since the "engine" is nothing more than a totally fair coin flipper.

If a methodology is run against a control with a known outcome and produces a wrong result, then what does that tell us about the methodology?

That the methodology sucks. But it is YOUR methodology that sucks, and this methodology is only a caricature of what statistical analysis is about. The point is that when you would really do this coin-flip experiments, the results would have a normal (Gaussian) distribution with a certain (known) spread sigma, and 68% of the match results would ly between 50% +/- sigma, and 95% between 50% +/- 2*sigma. You would hardly be able to even derive a hint from which "engine" was stronger in most (i.e. those 68%) of the cases. Only when it deviates from 50% by more than 2*sigma it would become unlikely to be just luck, and you would get a reasonable confidence that the winning engine indeed must be superior. But that confidence will not be 100%; there will still be some small probability that it was only luck after all, a statistical fluke.

Posted: **Tue Jun 30, 2020 12:10 pm**

Hello:

The well-known problem of checking whether a coin is fair might be on topic here, reinforcing HGM's POV of sigma. I posted a similar thought about binomial distribution in a recent topic at General Topics section which is somewhat related to this thread.

I see it in the following way: after many matches (not just one) with that insane number of games per match, if the same engine keeps winning match after match (even with so narrow margins), my guess is that it should be better than the other engine even if it takes forever to prove it.

------------

According to this Rémi's post from 2009, draws are not needed to calculate LOS:

------------

Talking about WiLo, apart from the thread of 2017 well brought up by Richard (which also features Taylor series for small differences) in the second post of this thread, there is other thread of 2013 where Kai and me investigated about win/lose ratio, which is the basis of WiLo:

Scaling at 2x nodes (or doubling time control).

Regards from Spain.

Ajedrecista.

Posted: **Tue Jun 30, 2020 1:18 pm**

There may be good math involved in the formulas.
But apply the formulas to the outcomes of the engine I provided.
If compiled with g++, and the width is 1000 games per trial, exactly two times, the LOS algorithm will say that the engines have the same strength, and for all the other trials, the formula will say that there are varying probabilities that one engine is better than the other all the way up to quite large probabilities. The problem is that the engines are never of different strength, even though the model is predicting that they are of different strength.

If I do ten games/flips/trials, I have a small number of outcomes. As I increase the number of games, the number of possible "bad runs" increases. There is always going to be statistical noise in the data. To assume that a very small number of outcomes in a huge, lengthy trial are due to superiority rather than random variation does not make sense to me.

If I play Engine A against Engine B 500 games and Engine A wins 450 out of 500, it is clearly much stronger. If I play a googol games and engine A wins 450 out of and B wins 50 then they are the same strength. Throwing out the draws cannot be correct. If the math says it is correct then there is an error in the math or an error in the model.

With the 500 games, engine A would be dominantly stronger and I would expect watching a match that Engine A would win almost every game.

Wouldn't you expect that?

With the googol games, if I watched for 100 lifetimes, I would never see a win, nor would I expect one.

Would you expect to see engine A beating engine B under those conditions?

Yet these two engines have exactly the same likelihood of superiority?

I find it hard to believe that you gentlemen actually believe what you are saying.

Posted: **Tue Jun 30, 2020 1:40 pm**

Dann Corbit wrote: ↑Tue Jun 30, 2020 1:18 pmThe problem is that the engines are never of different strength, even though the model is predicting that they are of different strength.

Not true. Again you make your own caricature of 'the model'. The actual model is not predicting any such thing. In fact it doesn't predict anything at all. It just tells you, after the fact, that there might be some (very small) likelihood one is better than the other, and a (very large) likelihood that they are the same strength.

If I do ten games/flips/trials, I have a small number of outcomes. As I increase the number of games, the number of possible "bad runs" increases. There is always going to be statistical noise in the data. To assume that a very small number of outcomes in a huge, lengthy trial are due to superiority rather than random variation does not make sense to me.

I suppose you refer to the 8 games that were won in your first example. Well, for one, there is nothing random about them; you selected those 8 games on their outcome. As soon as you are doing that all bets are off. The other googolplex of games could have been losses, and I suppose that would even convince you the other engine is superior, and you could still pick out the only 8 wins of the inferior one to 'demonstrate its superiority'. If you would randomly select games, it does not matter how long the trial is; 8 games is 8 games, and the probability on a certain result within those 8 games is just given by the WDL probability in a single game.

If I play Engine A against Engine B 500 games and Engine A wins 450 out of 500, it is clearly much stronger. If I play a googol games and engine A wins 450 out of and B wins 50 then they are the same strength.

Wrong again. They are nearly the same strength. But it is very clear which one of the two is infinitesimally stronger.

Throwing out the draws cannot be correct. If the math says it is correct then there is an error in the math or an error in the model

The math only says the draws can be thrown away without affecting the expectation for which one is stronger (LOS). Not for determining whether they are nearly the same strength or very far apart. As you want to use it for.

With the 500 games, engine A would be dominantly stronger and I would expect watching a match that Engine A would win almost every game.

Wouldn't you expect that?

With the googol games, if I watched for 100 lifetimes, I would never see a win, nor would I expect one.

Would you expect to see engine A beating engine B under those conditions?

That is not the relevant question, as it relates to the magnitude of the strength difference. The relevant question is:

If you do watch until you see a win, will it be engine A or engine B that wins?

Yet these two engines have exactly the same likelihood of superiority?

I find it hard to believe that you gentlemen actually believe what you are saying.

You would do well to disbelieve what you are hearing. Because that is not what we are saying. The problem is entirely on your end of the transmission line.

Posted: **Tue Jun 30, 2020 1:44 pm**

A tie or a draw is, by its very definition, an indicator of equality.
The more ties between opponents, the more likely that they are of the same strength.
This is obvious by the very definition of a tie or draw: Neither opponent was able to overcome the other.

TalkChess.com

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo.

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo