cutechess-cli suggestion (to Ilari)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

you can say with probability at least 1-delta that A is stronger than B (elo >= 0).
You have to be careful with terminology. Either A is stronger than B or it is not. There are no probalities involved (unless you take a Bayesian viewpoint but that changes the whole discussion).

The proper way to talk about this is to give the probabilities that a
test gives the wrong conclusion (before you do the test of course, otherwise
the conclusion is either wrong or right).


In our case we are really testing

H0: elo <= 0 against H1: elo > 0 (H0 and H1 should be complementary).

The Wald test with design parameters (alpha,beta,epsilon) gives a test such that

(1) The probability of a Type I error (the worst kind) is <= alpha (unconditionally).
(2) The probability of a Type II error is <= beta, provided elo >= epsilon.

Furthermore:

(3) You can compute the probability of a Type II error for any elo (In think
Wald himself has done this).

(4) You can compute the expected running time for any elo.


As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: cutechess-cli suggestion (to Ilari)

Post by Adam Hair »

lucasart wrote:
Michel wrote:
the log-likelyhood ratio is then (since the trinomial coeffients will simplify):
ln(L1/L0)(%w,%d) = (pw1-pw0).ln(%w) + (pd1-pd0).ln(%d) + (pl1-pl0).ln(1-%w-%d)
I have difficulty parsing this formula. I think it should be (this forum should really implement mathml, as all decent forums do these days).

log(pw1/pw0).%w + log(pd1/pd0).%d + log(pl1/pl0).(1-%w-%d).

Here log(pw1/pw0), log(pd1/pd0), log(pl1/pl0) are precomputed numbers.
So you see that you are really doing a one-dimensional random walk. This is the key to analyzing the problem mathematically.
On second thoughts, we should really look at Xi not Xi/n, because the law of Xi is what we want to know, while the law of Xi/n depends on n... So the correct loglikelyhood ratio should be

Code: Select all

LLR = ln&#40;pw1/pw0&#41;.nb_win + ln&#40;pd1/pd0&#41;.nb_draw + ln&#40;pl1/pl0&#41;.nb_loss
I implemented the SPTR algorithm and tested it:

=> Code

Code: Select all

#include <math.h>

static double log_likelyhood, pw1, pw0, pd1, pd0, pl1, pl0, a, b;

static void proba_elo&#40;double elo, double *pwin, double *pdraw, double *ploss&#41;
/* from BayesElo&#58; computes probabilities of win and draw, given the elo difference. What this does
 * is basically make a realistic assumption on &#40;pwin,pdraw&#41; for a given elo difference, that is for
 * a given theoretical score pwin+pdraw/2 */
&#123;
	static const double draw_elo = 97.3;
	*pwin  = 1 / &#40;1 + pow&#40;10, (-elo + draw_elo&#41; / 400&#41;);
	*ploss = 1 / &#40;1 + pow&#40;10, ( elo + draw_elo&#41; / 400&#41;);
	*pdraw = 1.0 - *pwin - *ploss;
&#125;

void init_SPRT&#40;double elo0, double elo1, double alpha, double beta&#41;
&#123;
	log_likelyhood = 0.0;
	proba_elo&#40;elo0, &pw0, &pd0, &pl0&#41;;
	proba_elo&#40;elo1, &pw1, &pd1, &pl1&#41;;
	
	a = log&#40;beta / &#40;1 - alpha&#41;);
	b = log&#40;&#40;1 - beta&#41; / alpha&#41;;
&#125;

int stop_SPRT&#40;int game_result&#41;
&#123;
	if &#40;1 == game_result&#41;
		log_likelyhood += log&#40;pw1/pw0&#41;;
	else if &#40;0 == game_result&#41;
		log_likelyhood += log&#40;pd1/pd0&#41;;
	else
		log_likelyhood += log&#40;pl1/pl0&#41;;
	
	if &#40;log_likelyhood > b&#41;
		return 1;	// accept H1&#58; elo_diff = elo1
	else if &#40;log_likelyhood < a&#41;
		return -1;	// accept H0&#58; elo_diff = elo1
	else
		return 0;	// continue sampling
&#125;
So after a call to init_SPRT(), everytime you call stop_SPRT(), it returns:
* 0 if no conclusion can be drawn (continue sampling)
* 1 if H1 is accepted (elo=elo1)
* -1 if H0 is accepted (elo=elo0)

=> Results

In practice, what happens is the following:
1/ when the true elo difference is elo > elo1, H1 is accepted in 100% cases, which is correct.
2/ when the true elo difference is < elo0, H0 is accepted in 100% cases, which is correct.
3/ when the true elo difference is between elo0 and elo1, any kind of nonsense can come out. For example when testing elo_diff = 0 vs elo_diff = +10, with a random generator P(win)=0.4 P(draw)=0.2, I get 96% conclusions on H1, ie it concludes 96% of the time that B is stronger than A, although they have the exact same elo by construction.

In terms of speed of convergence:
Cases 1/ and 2/ offer a good speed of convergence. Best performance is achieved by SPTR when close to the bounds elo0 and elo1, while still outside the interval (elo0,elo1). When we are far from the bounds, my algorithm is the one that concludes the fastest however, and SPTR performs pretty much like the empirical bernstein algorithm.
Case 3/ SPTR concludes comparatively fast, but it gives any kind of nonsense. EBStop here is the most accurate (no mistake ever, for any elo diff != 0 even a tiny difference), and my hacky algo also makes very very few mistakes (typically between 0.0%-0.3% as we get close to elo0) and conclused about twice faster than EBStop.

Now in real life, I am testing A (my modified program) versus B (my previous version of the program). And I want to know if A is better than B, if it is equal, or if it is worse. With a margin of error delta=5%, let's say. I don't even care by how much elo yet, I just want a definite answer to that simple question, with the least error possible and a fairly small number of games.

If the true difference is between my arbitrary bounds elo0 and elo1, which happens a lot if not most of the times in reality. I'm completely screwed!

The problem is that you have no way of knowing ig you are in case 1 2 or 3 a priori, and even setting a stopping time maximum could certainly lead to a catastrophic number of errors in many cases.

So that's why, I insist on saying that discussing things theoretically is not enough. If you don't experiment, you'll never understand if and how it works... (or doesn't work in this case)
The problem that I have with this approach is that it is an image of a 2d walk projected onto a 1d segment. This causes the computed bounds, a and b, to lose connection with &#945; and &#946;, the Type I and Type II errors.

Also, I would not blindly use drawelo = 97.3. Remi Coulom computed that number from WBEC results, which covers a large elo range. It could be much different for testing versions of DoubleCheck (I'm assuming that you do self-testing).
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: cutechess-cli suggestion (to Ilari)

Post by Adam Hair »

Michel wrote:As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.
That is a flaw that can not be avoided by any method.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

That is a flaw that can not be avoided by any method.
On what grounds are you stating this?

To give an extreme example

Suppose we supplement a Wald test with win ratio 0.501 vs win ratio 0.5 (the null hypothes) alpha=0.05 beta=0.05 with the following rule.

- If A has a score of 100% against B after 50 games then we declare
A stronger then B.

This will not significantly affect the Type I/II error rates but it will reduce
the test from 1500 games to 50 games in case A has winrate 100% against
B.

Of course I am not advocating making such random adjustments to the
Wald test (they would be difficult to analyze). It is just an example.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Michel wrote:
you can say with probability at least 1-delta that A is stronger than B (elo >= 0).
You have to be careful with terminology. Either A is stronger than B or it is not. There are no probalities involved (unless you take a Bayesian viewpoint but that changes the whole discussion).

The proper way to talk about this is to give the probabilities that a
test gives the wrong conclusion (before you do the test of course, otherwise
the conclusion is either wrong or right).


In our case we are really testing

H0: elo <= 0 against H1: elo > 0 (H0 and H1 should be complementary).

The Wald test with design parameters (alpha,beta,epsilon) gives a test such that

(1) The probability of a Type I error (the worst kind) is <= alpha (unconditionally).
(2) The probability of a Type II error is <= beta, provided elo >= epsilon.

Furthermore:

(3) You can compute the probability of a Type II error for any elo (In think
Wald himself has done this).

(4) You can compute the expected running time for any elo.

As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.
Yeah that's a good summary!

Anyway thanks for helping me in my quest for an efficient stopping algorithm. Will be useful to have that in cutechess-cli.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: cutechess-cli suggestion (to Ilari)

Post by Adam Hair »

Michel wrote:
That is a flaw that can not be avoided by any method.
On what grounds are you stating this?

To give an extreme example

Suppose we supplement a Wald test with win ratio 0.501 vs win ratio 0.5 (the null hypothes) alpha=0.05 beta=0.05 with the following rule.

- If A has a score of 100% against B after 50 games then we declare
A stronger then B.

This will not significantly affect the Type I/II error rates but it will reduce
the test from 1500 games to 50 games in case A has winrate 100% against
B.

Of course I am not advocating making such random adjustments to the
Wald test (they would be difficult to analyze). It is just an example.
I was being a bit narrow-minded. Any simple hypothesis test with alpha and beta specified will require a minimum amount of trials. As the alternative win rate decreases, the minimum amount of trials increases. However, if A wins the first 50 games, I would take my chances on assuming A is superior to B. In fact, if A won the first 5 games in a row it would reject the null hypothesis for some alternative win rate (winning the first 4 games would not).

Wald's test could be used for complex hypotheses. In the case of a binomial distribution (no draws), the number of wins necessary to reject the null hypothesis (win rate = 0.5) depends on the alternative win rate and number of games played (when alpha and beta are fixed). For a fixed number of games played, there exists an alternative win rate such that the necessary number of wins to reject H_0 is a minimum (actually, since the number of wins is an integer, then there is an interval of alternative win rates that produce the minimum wins necessary). As the number of games is varied, there is an envelope of necessary number of wins that reject H_0. For example, 5-0, 6-0, 7-0, 8-0, 8-1, 9-1, 10-1, 11-1,and 11-2 are all scores that reject H_0. Or, for each number of losses, there is a minimum number of wins that will reject H_0 for some alternative win rate (since 5-0, 8-1, and 11-2 are the most relevant scores).
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Michel wrote:As to flaws: I see only one. If you set up the test with small epsilon
then the running time will be large _even if the observed win ratio is say 100%_.
Here's a dirty way to solve that problem:
* say alpha=beta=delta, and epsilon is rather small.
* every time a game is played, test both SPRT and EBStop, whichever of the two decides to stop, you stop

EBStop, if it stops, allows to make a correct conclusion on the sign of mu, with practically zero error (even when I fixed delta = 0.1 I never had any error on 10,000 runs with any kind of P(win) and P(draw) that I threw at it)

SPRT when it stops it either correct with error <= delta, or conservative (ie it can accepts H0 when elo = epsilon/2).

So the guarante that you get with both algo is the lest of the two which is the guarantee of the SPRT algo.

In terms of stopping time, in most cases SPRT will stop before, especially as elo is close eo 0 ot epsilon, and even more so when 0 <= elo <= epsilon.
But when elo is rather far from zero, EBStop will often go faster than SPRT, and allow you to draw a conclusion safely, without waiting for SPRT to finish.

How about that ?
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

How about that ?
Well I don't know. I guess it would be difficult to establish rigorous results about the performance of such hybrid tests. But it is probably ok
if your are not overdoing it.

It seems to me that if you have an indication that there is large elo difference
(a lucky but unfortunately rare case) you may also stop the test and run new one with a relaxed epsilon.

Now the problem that the expected running time goes up if 0<elo<epsilon
is apparently regarded as much more serious in the literature (in that
case the Wald test is no longer optimal). Indeed you are basically in
a range where you don't care, yet you have to invest a lot of effort.

It seems a general solution to this problem is given by a so-called 2-SPRT
which is supposedly optimal for _all_ values of the parameters, not just
the two selected ones.

The 2-SPRT is introduced in this paper.

http://projecteuclid.org/DPubS?service= ... 1176343407

(scroll down for the open access pdf).

I think I will have to produce some graphs which give the Type II error
ratio and the expected running time versus elo for a set of
specific design parameters alpha,beta,epsilon
(alpha=0.05, beta=0.20, epsilon=5 seem reasonable) in the following cases

(1) A classical fixed length test.
(2) A classical SPRT.
(3) A 2-SPRT.

This will make it clearer what we are talking about.

I am currently busy (deadlines coming up), perhaps in weekend.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

Hmm. The paper about the 2-SPRT appears to be not so easy to read.

Here is the method:

(1) In the classical SPRT we are doing a random walk on an interval and take a decision based on which end point is reached first. The problem is
that for a particular parameter value theta2 between theta1 and
theta0 the "drift" of the random walk (the expected displacement per
time unit) becomes zero. So instead of speeding towards one
of the end points (which happens if theta is different from theta2) the random walk remains undecided and drifts idly around.

(2) In the 2-SPRT we are doing a race between two different random walks and take a decision based on who reaches the finish first (so there is only one boundary). Since one of the drifts is always guaranteed to be non-zero
we don't have the indecision problem of the SPRT.

The results in the paper seem to be applicable in the case where alpha=beta and we are estimating the mean of a Gaussian with known variance. The
last hypothesis is not verified our case.

Assuming for simplicity there are no draws, then we are estimating
the mean of a Bernouilli random variable. Obviously this is not
a Gaussian distribution. Perhaps this is not a problem as the law of
large numbers will somehow take care of it.

A perhaps more serious problem is that the variance of a Bernouilli random variable depends on its mean (albeit not too strongly).

So it remains to be seen if the results in the paper can be adapted to our
situation.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: cutechess-cli suggestion (to Ilari)

Post by Ferdy »

lucasart wrote:also another feature, which derives from the previous one quite simply is the following:

rather than playing 200 games, let's say.

maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.

the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.

this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.

Let me know what you think.

Lucas
I made an auto-testing system (composed of batch files and an executable) to run cutechess-cli and bayeselo, and depending on the result of bayeselo, will stop the test. Sample bayeselo report.
Rating table

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Crafty_23.4               84   36   35   200   63%    -8   22% 
   2 Deuterium-v11.02.29.82    42   49   49   100   44%    84   24% 
   3 Bison-v9.11               28   49   48   100   63%   -57   29% 
   4 Daydreamer-v1.75         -26   47   47   100   55%   -57   34% 
   5 Deuterium-v11.02.29.87   -57   24   25   400   41%     4   28% 
   6 N2-v0.4                  -70   48   48   100   48%   -57   28% 
LOS is here

Code: Select all

                        Cr De Bi Da De N2
Crafty_23.4                92 90 99 99 99
Deuterium-v11.02.29.82   7    60 90 98 98
Bison-v9.11              9 39    90 99 99
Daydreamer-v1.75         0  9  9    86 86
Deuterium-v11.02.29.87   0  1  0 13    67
N2-v0.4                  0  1  0 13 32   
My base engine is Deuterium-v11.02.29.87 with 400 games (I know this is low, just an experiment), and my test engine is Deuterium-v11.02.29.82. First I played base engine up to 400 games, then I let the test engine to run under auto-tester with conditions to stop if it is better/bad by elo criteria* and number of games (100) against the base engine.

* Elo criteria is calculated by the following
After running test engine to 100 games, run bayeselo, and get info on base engine and test engine.
From the rating table,

Code: Select all

te_min_elo = te_base_elo - te_min_margin = 42 - 49 = -7
be_max_elo = be_base_elo + be_max_margin = -57 + 24 =  -33

te = test engine
be = base engine
Since base engine max elo (-33) is already below the min elo (-7) of test engine, testing was stopped. The LOS was showing 98. I would like to ask an experts opinion, could this be acceptable to stop the games this early? Another stopping condition I have is stop the test if max elo of test engine is below min elo of base engine after 100 games.

I am a little bit unhappy with the 100 games, may be at least 50% of the base engine games.

Is there another method to stop testing based from the generated report of bayeselo?
Thanks.