Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Mon Jun 29, 2020 9:37 am

It is a generally accepted practice to throw out draws and use only wins and losses to calculate the relative Elo of a set of engines.
So let's do a gedankenexperiment:
Engine A plays Engine B one goolgol (10^100) times.
There are 10^100 - 8 draws and 8 wins for engine B.
Standard calculation would make engine B much stronger and also give a very large LOS for engine B.
But this is totally absurd.
If we watched games for many lifetimes between engine A and engine B, we would (almost surely) never see anything but a draw, despite engine B's much larger Elo and LOS.
At this point, the 8 wins are clearly random noise.

Opinions?

abulmo2 · Post by **abulmo2** » Mon Jun 29, 2020 10:06 am

If you need a large number of games to demonstrate the superiority of an engine, it usually means the superiority is not significant. Throwing out draws to calculate WILO has an higher sensitivity over regular ELO when the percentage of draws is reasonable, say 60 or 70%, not when it reaches 99,999...99992%. The idea of WILO is to reduce the number of games needed to show an effect :
http://www.talkchess.com/forum3/viewtopic.php?t=63875

Laskos wrote:"Wilo" is the name given by Miguel Ballicora to drawless Elo (draws are discarded). It was shown that rating model based on Wilo is sound, and Wilos are additive over logistic, just as Elos. No draw model is needed, as there are no draws. What I will show empirically from FGRL rating lists of Andreas Strangmüller Top 10 at 60''+0.6'' and 60'+15'' (roughly 60x time factor between the two lists) is:

1/ Wilo rating doesn't compress or dilate ratings from STC to LTC. Elo rating does compress ratings from STC to LTC
2/ Wilo rating give higher LOS (p-values), showing more sensitivity than Elo rating, therefore less games are needed for Wilo rating to show significant differences between engines than with Elo model

Therefore considering draws as non-games is better both as calibration of ratings (no time control dependence) and as number of games needed for significance.

Dann Corbit · Post by **Dann Corbit** » Mon Jun 29, 2020 10:31 am

abulmo2 wrote: ↑Mon Jun 29, 2020 10:06 am If you need a large number of games to demonstrate the superiority of an engine, it usually means the superiority is not significant. Throwing out draws to calculate WILO has an higher sensitivity over regular ELO when the percentage of draws is reasonable, say 60 or 70%, not when it reaches 99,999...99992%. The idea of WILO is to reduce the number of games needed to show an effect :
http://www.talkchess.com/forum3/viewtopic.php?t=63875
Laskos wrote:"Wilo" is the name given by Miguel Ballicora to drawless Elo (draws are discarded). It was shown that rating model based on Wilo is sound, and Wilos are additive over logistic, just as Elos. No draw model is needed, as there are no draws. What I will show empirically from FGRL rating lists of Andreas Strangmüller Top 10 at 60''+0.6'' and 60'+15'' (roughly 60x time factor between the two lists) is:

1/ Wilo rating doesn't compress or dilate ratings from STC to LTC. Elo rating does compress ratings from STC to LTC
2/ Wilo rating give higher LOS (p-values), showing more sensitivity than Elo rating, therefore less games are needed for Wilo rating to show significant differences between engines than with Elo model

Therefore considering draws as non-games is better both as calibration of ratings (no time control dependence) and as number of games needed for significance.

My gedankenexperiment was meant to show that at extreme scale, it simply cannot be correct. So what happens with 8 wins out of 10,000 games? Those engines are still so close to identical that the difference does not matter. There is some point where the draw count means something, perhaps a function is needed to describe it.

It was Albert Einstein who said, "Things should be made as simple as possible but no simpler."

hgm · Post by **hgm** » Mon Jun 29, 2020 10:44 am

Dann Corbit wrote: ↑Mon Jun 29, 2020 9:37 am At this point, the 8 wins are clearly random noise.

Opinions?

The probability that they are statistical noise super-imposed on an expected 4-4 is exactly as large as in an 8-0 match. The probability of 8-0 when there are no draws is 1 in 256. Make it two-sided by throwing in 0-8 as well, and it is 1 in 128.

I would hesitate to say that something that claims with >99% confidence that an event cannot happen must be 'clearly' the cause of that event when it does happen...

It seems you confuse 'likelihood of superiority' with 'amount of superiority'. But they are independent quantities. In your example the amount of superiority is obviously very small. But there is little doubt it exists, small as it is.

Dann Corbit · Post by **Dann Corbit** » Mon Jun 29, 2020 11:38 am

The probability of a draw is 1, and yet you say an engine is superior?
There must be another meaning of superior than the one that I am accustomed to.
I further claim that in a second run of 10^100 games engine A winning 8 and drawing the rest is easily as likely as B.

Somehow, there is an implicit assumption that there is no randomness in these games. Especially with nearly matched opponents, there is a lot of randomness.
If we were measuring anything besides Elo and superiority, we would say they are the same.

Something is definitely cockeyed. In fact, the fit for equality is so good, I think that there are statistical tests that would reject it simply because it is too good.

There is simply no way on the good earth that either engine is superior to the other given those numbers. To imagine that the 8 games are outside of statistical noise is mind boggling. At least for me, maybe you are harder to boggle.

Alayan · Post by **Alayan** » Mon Jun 29, 2020 1:07 pm

You could create an engine that has a 1 in X probability of timing out a game, with X an arbitrarily high number. It would be 100% inferior to the regular version without this weakening code, and provably so, yet no practical test just looking at games outcome might ever be able to prove it.

The idea we'd care only about likelihood of superiority without caring about how much one engine is superior to the other is deeply flawed.

The idea that Wilo is invariant with TC (outside of different scaling properties) is plain wrong because past some strength difference the proportion of weak side wins goes towards 0 as TC increases.

FastGM data with SF11, bullet :

Houdini 6 : 250 ( 122, 117, 11), 72.2 : +125, 4, 100.0
Komodo 13.2 : 250 ( 112, 123, 15), 69.4 : +128, 4, 100.0
Komodo 13.3 : 250 ( 114, 118, 18), 69.2 : +129, 4, 100.0
Komodo 14 : 250 ( 108, 128, 14), 68.8 : +134, 4, 100.0

5.8% weak side wins, 45.6% strong side wins, 7.86 ratio.

60m list :

Komodo 13.3 : 150 ( 50, 100, 0), 66.7 : +94, 6, 100.0
Komodo 13.2 : 150 ( 46, 99, 5), 63.7 : +100, 6, 100.0
Komodo 14 : 150 ( 43, 103, 4), 63.0 : +102, 7, 100.0
Houdini 6 : 150 ( 52, 98, 0), 67.3 : +106, 5, 100.0

1.5% weak side wins, 31.8% strong side wins. 21.22 ratio.

Komodo 13.3 bullet :

Fire 7.1 : 250 ( 118, 114, 18), 70.0 : +140, 3, 100.0
Ethereal 11.75 : 250 ( 122, 112, 16), 71.2 : +149, 4, 100.0
Xiphos 0.6 : 250 ( 100, 124, 26), 64.8 : +158, 3, 100.0
RofChade 2.3 : 250 ( 132, 100, 18), 72.8 : +198, 4, 100.0

7.8% weak side wins, 47.2% strong side wins. 6.05 ratio

60m list :

Ethereal 11.75 : 150 ( 51, 94, 5), 65.3 : +110, 6, 100.0
Fire 7.1 : 150 ( 54, 92, 4), 66.7 : +120, 5, 100.0
Xiphos 0.6 : 150 ( 44, 101, 5), 63.0 : +122, 5, 100.0
RofChade 2.3 : 150 ( 66, 81, 3), 71.0 : +136, 6, 100.0

2.8% weak side wins, 35.8% strong side wins. 12.64 ratio.

hgm · Post by **hgm** » Mon Jun 29, 2020 1:08 pm

It is infinitesimally superior. I cannot know what you are accustomed to. but it is generally accepted mathematical fact that 1+10^-100 > 1. That is not any new meaning of > than what I am accustomed to...

I further claim that in a second run of 10^100 games engine A winning 8 and drawing the rest is easily as likely as B.

Good thing that you are not a betting man, then, because the odds that that would happen are worse than 100:1.

I don't see anything strange here. It is just (a quite common) case of two products that are nearly perfect, but one has a tiny flaw that manifests itself only very rarely. The game tree of chess is large enough that it would stumble on something very special only a few times in 10^100 games. Perhaps a hash-key collision very close to the root that makes it play a losing blunder. And the other engine avoids that because it uses a 256-bit hash key.

Pio · Post by **Pio** » Mon Jun 29, 2020 1:55 pm

hgm wrote: ↑Mon Jun 29, 2020 1:08 pm It is infinitesimally superior. I cannot know what you are accustomed to. but it is generally accepted mathematical fact that 1+10^-100 > 1. That is not any new meaning of > than what I am accustomed to...

I further claim that in a second run of 10^100 games engine A winning 8 and drawing the rest is easily as likely as B.
Good thing that you are not a betting man, then, because the odds that that would happen are worse than 100:1.

I don't see anything strange here. It is just (a quite common) case of two products that are nearly perfect, but one has a tiny flaw that manifests itself only very rarely. The game tree of chess is large enough that it would stumble on something very special only a few times in 10^100 games. Perhaps a hash-key collision very close to the root that makes it play a losing blunder. And the other engine avoids that because it uses a 256-bit hash key.

Hi HGM!

I agree with you, the likelihood of superiority is at least a factor 128, but I would guess the likelihood of superiority is infinite since they have played 10^100 games and should thus have played more or less all drawing games there are and the 8 losses should be because a very small bug in the loosing engine.

It is easy to see that an upper bound of number of positions in chess is 2^(64*3) (since my board representation is in 3*64 bits) that should be something like 10^60 positions and I guess the true number is 10^40+. Of course the number of possible games is a lot larger but I do not know of the number of drawing games.

/Pio

Dann Corbit · Post by **Dann Corbit** » Mon Jun 29, 2020 6:04 pm

Absolutely, positively not a chance that the engines are anything but peers.

Try this program and tell me what you see:

Code: Select all

#include <random>
#include <iostream>
using namespace std;
class results {
	public:
    int lt_h;
    int gt_h;
    int ties;
    results()
    {
        lt_h = 0;
        gt_h = 0;
        ties = 0;
    }
};

int main(void)
{
    std::mt19937 generator (17);
    std::uniform_real_distribution<double> urd(0.0, 1.0);
    results * contests = new results[1000];
    for (int contest = 0; contest < 1000; contest++)
    {
        for (int result = 0; result < 10000; result++)
        {
            double value = urd(generator);
            if (value < 0.5) contests[contest].lt_h++;
            else if (value > 0.5) contests[contest].gt_h++;
            else  contests[contest].ties++;
        }
    }
    for (int contest = 0; contest < 1000; contest++)
    {
        std::cout << "losses: " << contests[contest].lt_h  << " wins: " <<  contests[contest].gt_h  << " ties: " <<  contests[contest].ties << std::endl;
    }
    delete [] contests;
    return 0;
}

Now, pick a pair from the output that are very close to each other and tell me what to conclude from it.

hgm · Post by **hgm** » Mon Jun 29, 2020 9:15 pm

Not sure how that is related to the original problem. Now you suddenly generate contests that will have no draws at all.

Now what would you think if these were not Chess engines, but Tic Tac Toe programs? Would you still think they are peers in every respect, or would you think one of them is running on hardware that is more buggy than the other?

Throwing out draws to calculate Elo

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo