margin of error

lucasart · Post by **lucasart** » Mon Sep 24, 2012 4:10 pm

Michel wrote:
You could use cutechess-cli to run one experi,ment (ie a game vs each opponent) and a python script that runs it and test the stopping rule.
Actually that is a briliant idea. To put a truncated Wald test in cutechess-cli (I believe cutechess-cli can run tournaments now).

The math for determining the parameters of a truncated Wald test is quite complicated (not the execution of the test itself which is trivial) but I wrote some notes about it, so I am available for consulting if necessary.

Indeed, the sequential Wald test should ideally be built-in to cutechess-cli. As I'm working on my own tournament manager, it's definitely on my todo list. I think the proper way of testing for an improvement with an epsilon elo resolution let's say is as follows:
1/ we plain 2 games against each program P(1)...P(n), one with white and one with black, in a given position. The results give a sequence of X(t). Note that this avoids the pitfall of self play, and the X(t) are independant and approximately equally distributed (if positions are chosen to give equal chances and a varied representation of the set of possibilities).
2/ we do a sequential Wald test of elo_diff = 0 vs elo_diff = epsilon, where scores and elos are deducted from one another using the formula from bayes elo (removing the color biais correction which isn't relevant here).
It's just about finding a way that is generic enough to not make the command line ugly and messy. cutechess-cli already plays gauntlets, so it's just a question of sampling double gauntlets, and defining a stopping rule. 3 param for the stopping rule: type I or II probability of error, and epsilon (elo resolution, where [0,epsilon] is the "grey zone" as usual with the -sequential or not- Wald test).
This would REALLY be a nice feature to have built-in to cutechess-cli, but unfortunately patching cutechess-cli is not trivial for me, as it requires a lot of QT knowledge (on top of C++). I've never even figured out how to install qmake on my Fedora 17, to begin with

Don · Post by **Don** » Mon Sep 24, 2012 4:16 pm

Michel wrote:
I'm not even using BayesElo in the simulation, I am simply using the simulation to determine what my ratio of correct vs incorrect results would be given various test setups. So I now know that if we make a 2 ELO improvement, 20,000 head to head games will give me the wrong answer almost 13% of the time. So we would end up keeping a regression in these cases.
I you are doing head to head there is no point in using BayesElo.

But if you test against a pool of foreign opponents, I think you are discarding a lot of information by not using information from old games (which you seem to suggest you are doing).

I'm not playing round robin so the A vs C results are not redundant. In order to evaluate whether A or B is stronger using method 2, I play each program against program C and then observe which of these 2 programs had the best result. That is the original question that generated this discussion, because Larry framed the question this way, how many more games is required, 2x or 4x? You say 0.5x but the right answer is 4x.

If you assume that you have already played the first match, then you only need 40,000 additional games, but you still required 80,000 games in total. So in practice maybe it's not as bad as 4x. Here is how it works:

Using the head to head, we play A vs B and get our answer so let's say we play 20,000 games to measure each potential improvement.

To get the equivalent result with the other system where a third program is always used as the reference to test against, we have to play 40,000 additional games to test each change. So we don't have to retest our current best version each time. But that doesn't change the fact that it required data from 80,000 games to get our answer.

These are idealized conditions, we are ignoring issues such as how you might go wrong with self-testing vs foreign testing - but at least in principle self-testing is twice as efficient when taking into consideration the fact that we are always comparing against a program that we have already done the work to evaluate.

Don · Post by **Don** » Mon Sep 24, 2012 4:26 pm

Michel wrote:
Michel wrote:
If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.

I'll look for that - do you have any references?

We don't really test head to head, but each version of Komodo tests against a set of opponents (not Komodo versions.)
The Wald test also works for non head to head tests. The key point is that the elo of the opponents needs to be known (which is the case for a head to head test since you may assume that the elo of the single opponent is zero).

The Wald test is for testing a single parameter (e.g. elo). It is implemented as a random walk were each outcome (win/draw/loss) changes a certain number (the likelihood ratio) in a precomputed way. Once the likelihood ratio falls outside a precomputed interval you accept either H0 or H1.

In practice you also have a truncation time and a threshold which determines whether you accept H0 or H1 at truncation.

I wrote a python utility that computes the parameters (alas for now only for a head to head test, I could extend it for more opponents).

http://hardy.uhasselt.be/Toga/wald
Code: Select all
Usage: wald [OPTION]...
Generate parameters for a truncated Wald test and optionally
produce an informative graph.

  --alpha       type I error probability when there is no elo difference
  --beta        type II error probability with elo difference equal to epsilon
  --epsilon     see --beta
  --truncation  truncate test after this many observations
  --graph       produce graph
  --draw_ratio  draw ratio when there is no elo difference
  --help        print this message
It will print the parameters of the random walk (i.e. what to add in case of W/D/L) as well as the stopping conditions.

It will also produce a graph which looks as follows

http://hardy.uhasselt.be/Toga/graph3.png

Tell me if I understand this correctly. Given the default setup of wald.py, I should score each win, draw and loss according to the Runtime parameters below, right?

And I STOP the test when my candidate program exceeds the H1 threshold or else it falls below the H0 threshold?

Do I have this right?

And what is the H1 threshold at truncation? Is that a way of saying the results was inconclusive if the score falls below that and the full 27874 games have been played?

I also do not understand type I and type II error probabilities. Are those used to calculate when the test is inconclusive somehow?

Code: Select all

Design parameters
=================
False positives            :   5.0%
False negatives            :  20.0%
Resolution                 :  5.0 Elo
Truncation                 :  27874 games
Draw ratio                 :  27.3%
DrawElo                    :  97.3 Elo

Runtime parameters
==================
Win                        :  +0.0182231820132940
Draw                       :  -0.0001916701998372
Loss                       :  -0.0184148522131312
H1 threshold               :  +3.2793097162108991
H0 threshold               :  -2.0840755492799832
Truncation point           :  27874
H1 threshold at truncation :  +0.9752924937499675

Characteristics
===============
Max expected # of games    :  19653 games
Minimal savings            :  22.4%

hgm · Post by **hgm** » Mon Sep 24, 2012 4:53 pm

Note that when you repeatedly test against the same version (e.g. because on the average you reject 4 changes before you accept one and promote it to new reference version) you use your games more efficiently when you play some extra games with the reference version. E.g. when A is your reference version, and you want to compare B, D, E and F with it by playing them all against C, and you want to do 4 x 20k = 80k games in total, you could play 20k-N games B-C, D-C, E-C and F-C each , and 20k+4*N games A-C. The Elo error in A-C is then proportional to 1/sqrt(20k-N), and those in the others 1/sqrt(20k+4*N), and each of the differences thus has error sqrt(1/(20k-N) + 1/(20k+4*N)).

Now this error is minimal when V = 1/(20k-N) + 1/(20k+4*N) is minimal.

dV/dN = 1/(20k-N)^ 2 - 4/(20k+4*N)^2 = 0
1/(20k-N)^2 = 4/(20k+4*N)^2
1/(20k-N) = 2/(20k+4*N)
20k+4*N = 2*(20k-N)
20k+4*N = 40k-2*N
6*N = 20k
N = 3,333

So by playing 33k games A-C and 17k games for the others, you get a more accurate value for the difference of of B,D,E,F with A. The error would drop from 0.01 to 0.0094867. Meaning that you could get the same error with only 90% of the games. (30k A-C and 15k of the four others).

This goes at the expense of getting poorer comparison between B,D,E and F. But this is a bit like a beta cutoff: you are not interested to determine which is the poorest of changes you are going to reject!

Michel · Post by **Michel** » Mon Sep 24, 2012 4:56 pm

Here is an explanation of the terminology. It is common statistics terminology. One cannot really discuss tests without this terminology.

H0 is the hypothesis that elodiff <=0 (the "null hypothesis")
H1 is the hypthesis is that elodiff >0 (i.e. the negative of the null hypothesis)

The outcome of a hypothesis test is either H0 or H1.

A type I error is the probability of a false positive (H0 is true but the test yields H1)

A type II is the probability of a false negative (H1 is true but the test yields H0).

To design a test in practice you fix these probabilities for certain values
of elodiff.

alpha is the probability of a type I error if elodiff=0.

beta is the probability of a type II error if elodiff=epsilon

"epsilon" is the resolution of the test (say 5 elo).

This is what the script "wald" does (it designs a truncated sequential Wald test with given alpha, beta, epsilon).

To verify how the test behaves for other values of elodiff, wald can create
the power plot of the test. This is a plot of "acceptance of H0 probability" versus elodiff.

The graphs produced by wald show that the truncated Wald
sequential test is just as powerful as the standard z-test but
requires far less testing.

I hope this clarifies things.

Michel · Post by **Michel** » Mon Sep 24, 2012 5:04 pm

Tell me if I understand this correctly. Given the default setup of wald.py, I should score each win, draw and loss according to the Runtime parameters below, right?

And I STOP the test when my candidate program exceeds the H1 threshold or else it falls below the H0 threshold?

Do I have this right?

But remember that these values are for a head to head test.

And what is the H1 threshold at truncation? Is that a way of saying the results was inconclusive if the score falls below that and the full 27874 games have been played?

If the test is truncated you accept H1 if the score is above the threshold.
Otherwise you accept H0.

Don · Post by **Don** » Mon Sep 24, 2012 5:10 pm

Michel wrote:
Tell me if I understand this correctly. Given the default setup of wald.py, I should score each win, draw and loss according to the Runtime parameters below, right?

And I STOP the test when my candidate program exceeds the H1 threshold or else it falls below the H0 threshold?

Do I have this right?
But remember that these values are for a head to head test.

And what is the H1 threshold at truncation? Is that a way of saying the results was inconclusive if the score falls below that and the full 27874 games have been played?
If the test is truncated you accept H1 if the score is above the threshold.
Otherwise you accept H0.

Thanks, I think I understand it now.

Don · Post by **Don** » Mon Sep 24, 2012 5:12 pm

Michel wrote:
Tell me if I understand this correctly. Given the default setup of wald.py, I should score each win, draw and loss according to the Runtime parameters below, right?

And I STOP the test when my candidate program exceeds the H1 threshold or else it falls below the H0 threshold?

Do I have this right?
But remember that these values are for a head to head test.

And what is the H1 threshold at truncation? Is that a way of saying the results was inconclusive if the score falls below that and the full 27874 games have been played?
If the test is truncated you accept H1 if the score is above the threshold.
Otherwise you accept H0.

One other questions. Am I supposed to set the draw ratio to reflect what I expect to see in the actual games?

When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?

ZirconiumX · Post by **ZirconiumX** » Mon Sep 24, 2012 5:14 pm

Wald.py crashes horribly for me.

Code: Select all

matthew$ ./wald.py
  File "./wald.py", line 238
    return (- 2*math.pi*n*math.exp(gamma*A)*(1 if n%2==0 else -1))/(A**2*gamma**2 + math.pi**2 * n**2)
                                                ^
SyntaxError: invalid syntax

Matthew:out

Michel · Post by **Michel** » Mon Sep 24, 2012 5:21 pm

When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?

Yes.

Wald uses the BayesElo model for draws. I don't actually know how good that model is.

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error