cutechess-cli suggestion (to Ilari)

Michel · Post by **Michel** » Sun Nov 13, 2011 3:13 pm

you can say with probability at least 1-delta that A is stronger than B (elo >= 0).

You have to be careful with terminology. Either A is stronger than B or it is not. There are no probalities involved (unless you take a Bayesian viewpoint but that changes the whole discussion).

The proper way to talk about this is to give the probabilities that a
test gives the wrong conclusion (before you do the test of course, otherwise
the conclusion is either wrong or right).

In our case we are really testing

H0: elo <= 0 against H1: elo > 0 (H0 and H1 should be complementary).

The Wald test with design parameters (alpha,beta,epsilon) gives a test such that

(1) The probability of a Type I error (the worst kind) is <= alpha (unconditionally).
(2) The probability of a Type II error is <= beta, provided elo >= epsilon.

Furthermore:

(3) You can compute the probability of a Type II error for any elo (In think
Wald himself has done this).

(4) You can compute the expected running time for any elo.

As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.

Adam Hair · Post by **Adam Hair** » Sun Nov 13, 2011 5:21 pm

lucasart wrote:
Michel wrote:
the log-likelyhood ratio is then (since the trinomial coeffients will simplify):
ln(L1/L0)(%w,%d) = (pw1-pw0).ln(%w) + (pd1-pd0).ln(%d) + (pl1-pl0).ln(1-%w-%d)
I have difficulty parsing this formula. I think it should be (this forum should really implement mathml, as all decent forums do these days).

log(pw1/pw0).%w + log(pd1/pd0).%d + log(pl1/pl0).(1-%w-%d).

Here log(pw1/pw0), log(pd1/pd0), log(pl1/pl0) are precomputed numbers.
So you see that you are really doing a one-dimensional random walk. This is the key to analyzing the problem mathematically.
On second thoughts, we should really look at Xi not Xi/n, because the law of Xi is what we want to know, while the law of Xi/n depends on n... So the correct loglikelyhood ratio should be
Code: Select all
LLR = ln&#40;pw1/pw0&#41;.nb_win + ln&#40;pd1/pd0&#41;.nb_draw + ln&#40;pl1/pl0&#41;.nb_loss
I implemented the SPTR algorithm and tested it:

=> Code
Code: Select all
#include <math.h>

static double log_likelyhood, pw1, pw0, pd1, pd0, pl1, pl0, a, b;

static void proba_elo&#40;double elo, double *pwin, double *pdraw, double *ploss&#41;
/* from BayesElo&#58; computes probabilities of win and draw, given the elo difference. What this does
 * is basically make a realistic assumption on &#40;pwin,pdraw&#41; for a given elo difference, that is for
 * a given theoretical score pwin+pdraw/2 */
&#123;
	static const double draw_elo = 97.3;
	*pwin  = 1 / &#40;1 + pow&#40;10, (-elo + draw_elo&#41; / 400&#41;);
	*ploss = 1 / &#40;1 + pow&#40;10, ( elo + draw_elo&#41; / 400&#41;);
	*pdraw = 1.0 - *pwin - *ploss;
&#125;

void init_SPRT&#40;double elo0, double elo1, double alpha, double beta&#41;
&#123;
	log_likelyhood = 0.0;
	proba_elo&#40;elo0, &pw0, &pd0, &pl0&#41;;
	proba_elo&#40;elo1, &pw1, &pd1, &pl1&#41;;
	
	a = log&#40;beta / &#40;1 - alpha&#41;);
	b = log&#40;&#40;1 - beta&#41; / alpha&#41;;
&#125;

int stop_SPRT&#40;int game_result&#41;
&#123;
	if &#40;1 == game_result&#41;
		log_likelyhood += log&#40;pw1/pw0&#41;;
	else if &#40;0 == game_result&#41;
		log_likelyhood += log&#40;pd1/pd0&#41;;
	else
		log_likelyhood += log&#40;pl1/pl0&#41;;
	
	if &#40;log_likelyhood > b&#41;
		return 1;	// accept H1&#58; elo_diff = elo1
	else if &#40;log_likelyhood < a&#41;
		return -1;	// accept H0&#58; elo_diff = elo1
	else
		return 0;	// continue sampling
&#125;
So after a call to init_SPRT(), everytime you call stop_SPRT(), it returns:
* 0 if no conclusion can be drawn (continue sampling)
* 1 if H1 is accepted (elo=elo1)
* -1 if H0 is accepted (elo=elo0)

=> Results

In practice, what happens is the following:
1/ when the true elo difference is elo > elo1, H1 is accepted in 100% cases, which is correct.
2/ when the true elo difference is < elo0, H0 is accepted in 100% cases, which is correct.
3/ when the true elo difference is between elo0 and elo1, any kind of nonsense can come out. For example when testing elo_diff = 0 vs elo_diff = +10, with a random generator P(win)=0.4 P(draw)=0.2, I get 96% conclusions on H1, ie it concludes 96% of the time that B is stronger than A, although they have the exact same elo by construction.

In terms of speed of convergence:
Cases 1/ and 2/ offer a good speed of convergence. Best performance is achieved by SPTR when close to the bounds elo0 and elo1, while still outside the interval (elo0,elo1). When we are far from the bounds, my algorithm is the one that concludes the fastest however, and SPTR performs pretty much like the empirical bernstein algorithm.
Case 3/ SPTR concludes comparatively fast, but it gives any kind of nonsense. EBStop here is the most accurate (no mistake ever, for any elo diff != 0 even a tiny difference), and my hacky algo also makes very very few mistakes (typically between 0.0%-0.3% as we get close to elo0) and conclused about twice faster than EBStop.

Now in real life, I am testing A (my modified program) versus B (my previous version of the program). And I want to know if A is better than B, if it is equal, or if it is worse. With a margin of error delta=5%, let's say. I don't even care by how much elo yet, I just want a definite answer to that simple question, with the least error possible and a fairly small number of games.

If the true difference is between my arbitrary bounds elo0 and elo1, which happens a lot if not most of the times in reality. I'm completely screwed!

The problem is that you have no way of knowing ig you are in case 1 2 or 3 a priori, and even setting a stopping time maximum could certainly lead to a catastrophic number of errors in many cases.

So that's why, I insist on saying that discussing things theoretically is not enough. If you don't experiment, you'll never understand if and how it works... (or doesn't work in this case)

The problem that I have with this approach is that it is an image of a 2d walk projected onto a 1d segment. This causes the computed bounds, a and b, to lose connection with α and β, the Type I and Type II errors.

Also, I would not blindly use drawelo = 97.3. Remi Coulom computed that number from WBEC results, which covers a large elo range. It could be much different for testing versions of DoubleCheck (I'm assuming that you do self-testing).

Adam Hair · Post by **Adam Hair** » Sun Nov 13, 2011 6:35 pm

Michel wrote:As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.

That is a flaw that can not be avoided by any method.

Michel · Post by **Michel** » Sun Nov 13, 2011 6:47 pm

That is a flaw that can not be avoided by any method.

On what grounds are you stating this?

To give an extreme example

Suppose we supplement a Wald test with win ratio 0.501 vs win ratio 0.5 (the null hypothes) alpha=0.05 beta=0.05 with the following rule.

- If A has a score of 100% against B after 50 games then we declare
A stronger then B.

This will not significantly affect the Type I/II error rates but it will reduce
the test from 1500 games to 50 games in case A has winrate 100% against
B.

Of course I am not advocating making such random adjustments to the
Wald test (they would be difficult to analyze). It is just an example.

lucasart · Post by **lucasart** » Sun Nov 13, 2011 10:45 pm

Michel wrote:
you can say with probability at least 1-delta that A is stronger than B (elo >= 0).
You have to be careful with terminology. Either A is stronger than B or it is not. There are no probalities involved (unless you take a Bayesian viewpoint but that changes the whole discussion).

The proper way to talk about this is to give the probabilities that a
test gives the wrong conclusion (before you do the test of course, otherwise
the conclusion is either wrong or right).

In our case we are really testing

H0: elo <= 0 against H1: elo > 0 (H0 and H1 should be complementary).

The Wald test with design parameters (alpha,beta,epsilon) gives a test such that

(1) The probability of a Type I error (the worst kind) is <= alpha (unconditionally).
(2) The probability of a Type II error is <= beta, provided elo >= epsilon.

Furthermore:

(3) You can compute the probability of a Type II error for any elo (In think
Wald himself has done this).

(4) You can compute the expected running time for any elo.

As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.

Yeah that's a good summary!

Anyway thanks for helping me in my quest for an efficient stopping algorithm. Will be useful to have that in cutechess-cli.

Adam Hair · Post by **Adam Hair** » Mon Nov 14, 2011 2:18 am

Michel wrote:
That is a flaw that can not be avoided by any method.
On what grounds are you stating this?

To give an extreme example

Suppose we supplement a Wald test with win ratio 0.501 vs win ratio 0.5 (the null hypothes) alpha=0.05 beta=0.05 with the following rule.

- If A has a score of 100% against B after 50 games then we declare
A stronger then B.

This will not significantly affect the Type I/II error rates but it will reduce
the test from 1500 games to 50 games in case A has winrate 100% against
B.

Of course I am not advocating making such random adjustments to the
Wald test (they would be difficult to analyze). It is just an example.

I was being a bit narrow-minded. Any simple hypothesis test with alpha and beta specified will require a minimum amount of trials. As the alternative win rate decreases, the minimum amount of trials increases. However, if A wins the first 50 games, I would take my chances on assuming A is superior to B. In fact, if A won the first 5 games in a row it would reject the null hypothesis for some alternative win rate (winning the first 4 games would not).

Wald's test could be used for complex hypotheses. In the case of a binomial distribution (no draws), the number of wins necessary to reject the null hypothesis (win rate = 0.5) depends on the alternative win rate and number of games played (when alpha and beta are fixed). For a fixed number of games played, there exists an alternative win rate such that the necessary number of wins to reject H_0 is a minimum (actually, since the number of wins is an integer, then there is an interval of alternative win rates that produce the minimum wins necessary). As the number of games is varied, there is an envelope of necessary number of wins that reject H_0. For example, 5-0, 6-0, 7-0, 8-0, 8-1, 9-1, 10-1, 11-1,and 11-2 are all scores that reject H_0. Or, for each number of losses, there is a minimum number of wins that will reject H_0 for some alternative win rate (since 5-0, 8-1, and 11-2 are the most relevant scores).

lucasart · Post by **lucasart** » Mon Nov 14, 2011 9:02 am

Michel wrote:As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.

Here's a dirty way to solve that problem:
* say alpha=beta=delta, and epsilon is rather small.
* every time a game is played, test both SPRT and EBStop, whichever of the two decides to stop, you stop

EBStop, if it stops, allows to make a correct conclusion on the sign of mu, with practically zero error (even when I fixed delta = 0.1 I never had any error on 10,000 runs with any kind of P(win) and P(draw) that I threw at it)

SPRT when it stops it either correct with error <= delta, or conservative (ie it can accepts H0 when elo = epsilon/2).

So the guarante that you get with both algo is the lest of the two which is the guarantee of the SPRT algo.

In terms of stopping time, in most cases SPRT will stop before, especially as elo is close eo 0 ot epsilon, and even more so when 0 <= elo <= epsilon.
But when elo is rather far from zero, EBStop will often go faster than SPRT, and allow you to draw a conclusion safely, without waiting for SPRT to finish.

How about that ?

Michel · Post by **Michel** » Mon Nov 14, 2011 11:34 am

How about that ?

Well I don't know. I guess it would be difficult to establish rigorous results about the performance of such hybrid tests. But it is probably ok
if your are not overdoing it.

It seems to me that if you have an indication that there is large elo difference
(a lucky but unfortunately rare case) you may also stop the test and run new one with a relaxed epsilon.

Now the problem that the expected running time goes up if 0<elo<epsilon
is apparently regarded as much more serious in the literature (in that
case the Wald test is no longer optimal). Indeed you are basically in
a range where you don't care, yet you have to invest a lot of effort.

It seems a general solution to this problem is given by a so-called 2-SPRT
which is supposedly optimal for _all_ values of the parameters, not just
the two selected ones.

The 2-SPRT is introduced in this paper.

http://projecteuclid.org/DPubS?service= ... 1176343407

(scroll down for the open access pdf).

I think I will have to produce some graphs which give the Type II error
ratio and the expected running time versus elo for a set of
specific design parameters alpha,beta,epsilon
(alpha=0.05, beta=0.20, epsilon=5 seem reasonable) in the following cases

(1) A classical fixed length test.
(2) A classical SPRT.
(3) A 2-SPRT.

This will make it clearer what we are talking about.

I am currently busy (deadlines coming up), perhaps in weekend.

Michel · Post by **Michel** » Mon Nov 14, 2011 12:53 pm

Hmm. The paper about the 2-SPRT appears to be not so easy to read.

Here is the method:

(1) In the classical SPRT we are doing a random walk on an interval and take a decision based on which end point is reached first. The problem is
that for a particular parameter value theta2 between theta1 and
theta0 the "drift" of the random walk (the expected displacement per
time unit) becomes zero. So instead of speeding towards one
of the end points (which happens if theta is different from theta2) the random walk remains undecided and drifts idly around.

(2) In the 2-SPRT we are doing a race between two different random walks and take a decision based on who reaches the finish first (so there is only one boundary). Since one of the drifts is always guaranteed to be non-zero
we don't have the indecision problem of the SPRT.

The results in the paper seem to be applicable in the case where alpha=beta and we are estimating the mean of a Gaussian with known variance. The
last hypothesis is not verified our case.

Assuming for simplicity there are no draws, then we are estimating
the mean of a Bernouilli random variable. Obviously this is not
a Gaussian distribution. Perhaps this is not a problem as the law of
large numbers will somehow take care of it.

A perhaps more serious problem is that the variance of a Bernouilli random variable depends on its mean (albeit not too strongly).

So it remains to be seen if the results in the paper can be adapted to our
situation.

Ferdy · Post by **Ferdy** » Mon Nov 14, 2011 3:56 pm

lucasart wrote:also another feature, which derives from the previous one quite simply is the following:

rather than playing 200 games, let's say.

maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.

the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.

this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.

Let me know what you think.

Lucas

I made an auto-testing system (composed of batch files and an executable) to run cutechess-cli and bayeselo, and depending on the result of bayeselo, will stop the test. Sample bayeselo report.
Rating table

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Crafty_23.4               84   36   35   200   63%    -8   22% 
   2 Deuterium-v11.02.29.82    42   49   49   100   44%    84   24% 
   3 Bison-v9.11               28   49   48   100   63%   -57   29% 
   4 Daydreamer-v1.75         -26   47   47   100   55%   -57   34% 
   5 Deuterium-v11.02.29.87   -57   24   25   400   41%     4   28% 
   6 N2-v0.4                  -70   48   48   100   48%   -57   28%

LOS is here

Code: Select all

                        Cr De Bi Da De N2
Crafty_23.4                92 90 99 99 99
Deuterium-v11.02.29.82   7    60 90 98 98
Bison-v9.11              9 39    90 99 99
Daydreamer-v1.75         0  9  9    86 86
Deuterium-v11.02.29.87   0  1  0 13    67
N2-v0.4                  0  1  0 13 32

My base engine is Deuterium-v11.02.29.87 with 400 games (I know this is low, just an experiment), and my test engine is Deuterium-v11.02.29.82. First I played base engine up to 400 games, then I let the test engine to run under auto-tester with conditions to stop if it is better/bad by elo criteria* and number of games (100) against the base engine.

* Elo criteria is calculated by the following
After running test engine to 100 games, run bayeselo, and get info on base engine and test engine.
From the rating table,

Code: Select all

te_min_elo = te_base_elo - te_min_margin = 42 - 49 = -7
be_max_elo = be_base_elo + be_max_margin = -57 + 24 =  -33

te = test engine
be = base engine

Since base engine max elo (-33) is already below the min elo (-7) of test engine, testing was stopped. The LOS was showing 98. I would like to ask an experts opinion, could this be acceptable to stop the games this early? Another stopping condition I have is stop the test if max elo of test engine is below min elo of base engine after 100 games.

I am a little bit unhappy with the 100 games, may be at least 50% of the base engine games.

Is there another method to stop testing based from the generated report of bayeselo?
Thanks.

cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)