Throwing out draws to calculate Elo

Ovyron · Post by **Ovyron** » Sat Jul 04, 2020 2:25 am

Dann Corbit wrote: ↑Fri Jul 03, 2020 7:18 am You mean you didn't see all those places where an engine is ranked higher than the one under it and yet the LOS is below 50? There are lots of them. That means Elo calculation says it is stronger, but LOS says it is weaker.

That never happened, not even once.

All the times LOS is below 50, Elo said both engines had the same strength. Check again.

Dann Corbit · Post by **Dann Corbit** » Sat Jul 04, 2020 7:59 am

You don't get it, Elo says they are even. LOS says the second engine is stronger. They don't agree, and I trust Elo, which is accurate. The other places where Elo is equal the LOS will almost always be higher. Again, wrong. Elo is accurate, LOS is not

LOS gives no useful information. It always says that the engine with most wins is stronger *duh* and ignores draws when calculating strength.

Dann Corbit · Post by **Dann Corbit** » Sat Jul 04, 2020 8:01 am

LOS also does not understand the strength of any of the opponents. All it knows is wins and losses, that's it

hgm · Post by **hgm** » Sat Jul 04, 2020 10:30 am

Dann Corbit wrote: ↑Sat Jul 04, 2020 1:34 am
Dann Corbit wrote: ↑Sat Jul 04, 2020 12:46 am I also do not believe you when you say you would change your engine based on 10 games with an 8 wins and 2 draws outcome. That's a lie and you know it
Now (having thought about what I said) I am forced to apologize for saying that. I would never make a change based on ten games, and I was projecting that onto you, and I have no right to do that.
I apologize because I cannot read your mind, and that is the basis of my claim that you were lying

There sure is a lie there, but it is already hidden in the question: you are asking what people will do in hypothetical cases that cannot possibly occur in practice, and then whatever they answer will then always be true. The lie is the assumption that the condition in the question could be fulfilled.

You can only detect improvements of about 500 Elo with 10-game matches, and unless your engine is badly broken to start with, no patch can ever give you that. And if you want to check whether you fixed an engine that is known to be broken, you would not test it against the broken version, but against an engine you know to work. So you would never start 10-game matches, and it is unsound to stop matches earlier because of the score. (Although a careful analysis would show it can be justifiable to allow aborting after very extreme scores, if you adapt your acceptance criterion for that; it would be pretty foolhardy to continue playing 10,000 more games with a version that loses the first 20 games, just because that was what you planned to do in the believe the patch would have a very small effect.)

The only situation that would be remotely like this would be for testing wether a change completely broke the engine, where you are testing for a >500 Elo drop in strength. After a change I always keep watching tests until I see that the new version is able to win games.

The discussion so far has ignored the existence of prior likelihood (even though one person has pointed out its importance): There are things you can consider known even before playing any game at all. Like that tweaking a piece value away from the classical 1:3:3:5:9 ratio will NEVER give you a 500-Elo (or even a 50-Elo) improvement. The LOS (or Elo error bars) is only a likelihood. And, as they say, "when the impossible is eliminated, whatever remains, no matter how unlikely must be the truth".

So if you do a test after some innocent eval-parameter tweaking, and the Elo calculator says that made you 250 Elo stronger, with a confidence of 99.9% (the error bar being only 100 Elo), the 0.1% remaining probability that this is a statistical fluke in a match between nearly equal engines is not small enough to overturn the prior likelihood against this possibility, which is more than a million to 1 (if not actually infinite). The probability for the fluke is still enormously higher that the probability that the tweak caused a 250 Elo increase.

You argue like this is a problem with LOS, but it is just as bad with Elo. And 8:2:0 WDL result doesn't only produce a 99%+ LOS, it also produces an Elo jump that lies far outside the error bars given by the Elo calculator. Obviously you did not bother to calculate that, biased as you are in favor of Elo. But if rejecting the result is taken as proof that you don't trust LOS, then it equally proves you don't trust Elo. It is just a consequence of the larger general picture, that you have prior knowledge about the effect the engine change could have, and that your tests do not just measure engine strength and statistical noise, but also 'operator error'; likely you blundered in typing the correct name for the engine executable, a config file of the opponent engine got inadvertantly overwritten by garbage, so that it set all its piece values to zero, a virus infected your system and hid itself in the engine executable, etc.

It is really weird that you think that occurrence of draws would say anything at all about what would happen when there is not a draw.

"My car typically drives 100,000 miles without breaking down".
"Oh, that is quite a lot of miles. It hardly ever breaks down at all! That makes it absolutely sure that you have a flat tire as often as that the roof melts."
"Well, last 8 times it broke down it was a flat. Never seen the roof melt..."
"That can only be because you were very, very lucky. Next 8 times you could easily have to deal with molten roofs!"

Do you recognize anything strange in the above conversation at all?

syzygy · Post by **syzygy** » Sat Jul 04, 2020 1:21 pm

Dann Corbit wrote: ↑Sat Jul 04, 2020 12:46 am As I have said, on many occasions, I do not disagree with your number (in this case 1/256). I disagree with the value of your number. With 8 trials the number is not trustworthy.

That you find 1/256 not a small probability is fine with me. If crossing the street bears a 1/256 probabilty of being overrun by a truck, I would try to avoid having to cross the street.

But you are continuously going back and forth between "8 is not enough" and "8 is enough but not if there are a lot of draws". That makes you rather unreliable to have such a discussion with and is bringing me very close to the point of just putting you on ignore as a warning not to waste any more time on this and similar threads in the future.

syzygy · Post by **syzygy** » Sat Jul 04, 2020 1:47 pm

Dann Corbit wrote: ↑Sat Jul 04, 2020 12:52 am Bayesian logic says that we should change our expectation based on new information.
If I have 8 wins and 2 draws LOS gives me a number. If I add a thousand draws, LOS gives me the same number.
Elo, on the other hand, would use the new information to change the expectation. Elo is really only an indicator of stronger/weaker also , when you think about it. The actual numbers (Elo is 2500) are meaningless because the only thing it really tells you is difference in strength with respect to a pool.

The Elo figure, with enough draws, will tell you the change is not worthwhile. It has changed its expectation about who is stronger, based on draws. Is the Elo calculation wrong? If not, why were the draws helpful?

Ok, so this thread has no use whatsoever. You are on ignore from now, since there are better things to waste time on.

hgm · Post by **hgm** » Sat Jul 04, 2020 1:48 pm

It is difficult to deduce what his problem is, as what he writes is self-contradicory. He doesn't disagree with the number, but with its value... What the heck does that mean??? That he recognizes the likelihood is represented by some number between 0 and 1, but that 1/256 is not it?

"If I throw a fair die, the probability a 6 will come up is 1/6."
"No, you cannot say that. With so few trials 1/6 is not trustworthy."

It somehow suggest a very deep lack of understanding for what 'probability' means.

Ovyron · Post by **Ovyron** » Sun Jul 05, 2020 7:06 am

Dann Corbit wrote: ↑Sat Jul 04, 2020 7:59 am You don't get it, Elo says they are even. LOS says the second engine is stronger. They don't agree, and I trust Elo, which is accurate. The other places where Elo is equal the LOS will almost always be higher. Again, wrong. Elo is accurate, LOS is not

LOS gives no useful information. It always says that the engine with most wins is stronger *duh* and ignores draws when calculating strength.

This may be the crux of the issue:

Elo says the engines are equal.

LOS tells you the probability that Elo is wrong, and one of the engines is actually stronger than the other. Because if the test was run again, there are chances that one of the engines appears this time with a higher Elo, and LOS tells you which one is more likely to be on top, and this guessing doesn't require including draws.

In the end, if you play 10^100 games and they're all drawn except for 8 when A beats B, what LOS is telling you is that if you repeated the test it'd be more likely that you'd get more wins from A than from B.

Every time you've run your gedankenexperiment on this thread, A has beaten B again, so thus far LOS has been right. Their Elo is very close but A beats B now and then, no matter how rarely, it has a higher LOS (or you'd have been telling us about experiments where B was the one getting those wins.)

Dann Corbit · Post by **Dann Corbit** » Sun Jul 05, 2020 1:06 pm

Edit: I left out an important detail. Rather than imply multiplying by 100 I actually rounded to 0 decimal places. The calculation actually was:
SELECT ( round(los_prob * 100,0)) . Now anyone should be able to replicate exactly.

If you take the output of the program that calculates LOS for two exactly equal opponents so the score should be 50, you will see the following distribution if you multiply the LOS scores by 100 to get bins from 0 to 100:

Code: Select all

pct,count(pct)
100,456
99,996
98,1056
97,990
96,961
95,1024
94,945
93,997
92,1016
91,900
90,970
89,1116
88,1004
87,931
86,1013
85,1000
84,931
83,1162
82,1001
81,1075
80,923
79,1164
78,980
77,993
76,958
75,1004
74,1012
73,970
72,1108
71,867
70,1067
69,849
68,1162
67,942
66,941
65,1169
64,980
63,931
62,988
61,970
60,1017
59,1248
58,1007
57,1022
56,1005
55,1001
54,959
53,1025
52,1006
51,961
50,809
49,973
48,999
47,970
46,977
45,1007
44,981
43,1014
42,959
41,1269
40,1018
39,972
38,972
37,932
36,951
35,1171
34,937
33,935
32,1124
31,861
30,1043
29,863
28,1068
27,1050
26,1089
25,993
24,930
23,990
22,984
21,1117
20,887
19,1021
18,896
17,1095
16,938
15,1040
14,961
13,913
12,1041
11,1032
10,961
9,935
8,1057
7,1058
6,1001
5,978
4,953
3,1028
2,1026
1,972
0,476

you will notice that there are about 1000 items in each bin, except for the first and last bins, which have about 500 each.
That means that it is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure A is stronger than B" to "I am absolutely sure that A is not stronger than B" We can also turn it around and say the same thing in the other direction.

It is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure B is stronger than A" to "I am absolutely sure that B is not stronger than A" .

I offered this data last week in table form and even as a relational database table, but nobody seemed very interested.

Anyway, you questioned why I would use a LOS algorithm to test superiority of an engine over itself. I mean, it does sound kind of silly because we already know the answer, "It's not superior to itself." but that is exactly why the experiment is important. If the algorithm claims that the program is superior to itself when it is not superior to itself, then that indicates a problem.
What we see here is that the LOS algorithm coughs up an answer that is bad most of the time.
That is because as we have more and more trials, the wins and losses of the same engine do move towards the mean. However, the raw number of wins and losses that are not exactly on the mean will increase in spread (even though, on average, they compute a better mean). This destroys the LOS calculation.

Now, the LOS calculation is not the worst calculation in the world. It also tells us the same thing that common sense tells us. If engine A has more wins than engine B, it is probably stronger. But because it does not care about draws is is missing important information (including the number of games in total).

Another important reason for testing LOS using an engine against itself is that the only thing I have ever seen it used for is for a tiebreaker. For instance CCRL uses it to tell differences between adjacent engines on their list. That makes them (by definition of the ordered list) fairly close in strength to each other.

It may be that with only a couple thousand games the error spread has not grown large enough to dominate. And so the answers may be OK. But I also think it is faulty for throwing out draw numbers. Draw numbers impart vital information about strength (as demonstrated by the Elo calculation) and therefore tossing out that information makes the algorithm more prone to bad guesses.

hgm · Post by **hgm** » Sun Jul 05, 2020 1:14 pm

'Equal' just means equal after rounding to an integer. If two Elos are known very accurately, they can be outside each other's error bars, but still round to the same value. I don't know how large the reported LOS in these cases is. It seems Dann already takes offense when he sees LOS = 0.60, or even LOS = 51%, while this just means "too close to call in any meaningful way".

Also note that the Error bars in an Elo calculation do not show the full story: they do not show the correlations between the ratings. Two engines can have ratings that are determined very accurately w.r.t. each other (e.g. by playing very many mutual games, or games within a small goup), but very poorly w.r.t. the remainder of the list, because they hardly played any games against those. This would give them both large error bars within the list, so that they would be well within each other's error bars, while the error bar on their Elo difference could be very small (and hence the LOS quite high). The LOS is specific for the pair of engine it compares, and takes all this into account.

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo