cutechess-cli suggestion (to Ilari)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Ferdy wrote:
lucasart wrote:also another feature, which derives from the previous one quite simply is the following:

rather than playing 200 games, let's say.

maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.

the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.

this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.

Let me know what you think.

Lucas
I made an auto-testing system (composed of batch files and an executable) to run cutechess-cli and bayeselo, and depending on the result of bayeselo, will stop the test. Sample bayeselo report.
Rating table

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Crafty_23.4               84   36   35   200   63%    -8   22% 
   2 Deuterium-v11.02.29.82    42   49   49   100   44%    84   24% 
   3 Bison-v9.11               28   49   48   100   63%   -57   29% 
   4 Daydreamer-v1.75         -26   47   47   100   55%   -57   34% 
   5 Deuterium-v11.02.29.87   -57   24   25   400   41%     4   28% 
   6 N2-v0.4                  -70   48   48   100   48%   -57   28% 
LOS is here

Code: Select all

                        Cr De Bi Da De N2
Crafty_23.4                92 90 99 99 99
Deuterium-v11.02.29.82   7    60 90 98 98
Bison-v9.11              9 39    90 99 99
Daydreamer-v1.75         0  9  9    86 86
Deuterium-v11.02.29.87   0  1  0 13    67
N2-v0.4                  0  1  0 13 32   
My base engine is Deuterium-v11.02.29.87 with 400 games (I know this is low, just an experiment), and my test engine is Deuterium-v11.02.29.82. First I played base engine up to 400 games, then I let the test engine to run under auto-tester with conditions to stop if it is better/bad by elo criteria* and number of games (100) against the base engine.

* Elo criteria is calculated by the following
After running test engine to 100 games, run bayeselo, and get info on base engine and test engine.
From the rating table,

Code: Select all

te_min_elo = te_base_elo - te_min_margin = 42 - 49 = -7
be_max_elo = be_base_elo + be_max_margin = -57 + 24 =  -33

te = test engine
be = base engine
Since base engine max elo (-33) is already below the min elo (-7) of test engine, testing was stopped. The LOS was showing 98. I would like to ask an experts opinion, could this be acceptable to stop the games this early? Another stopping condition I have is stop the test if max elo of test engine is below min elo of base engine after 100 games.

I am a little bit unhappy with the 100 games, may be at least 50% of the base engine games.

Is there another method to stop testing based from the generated report of bayeselo?
Thanks.
I don't know exactly how BayesElo calculates the LOS. So I can't say for sure. But I get the impression that stopping on the LOS is unsound. Because your algo would have a margin of error <= 5% at time t. That is when you decide t in advance. But if for any t you can stop with error <= 5%, you cannot say that the error overall is <= 5%.

On the other hand if d(t) =5%*(t*(t+1)) so that sum(t=0,infinity) d(t) = 5%, then stopping at time t if LOS >= 1-d(t) or LOS <= d(t) should be theoretically sound.

But again, maybe the LOS calculated by BayesElo isn't what I think it is, so I don't know.

The only way to see if the algo works in practice is to feed random numbers with given proba P(win) P(draw) into BayesElo (which is *much* faster than playing games), and run the stopping algo many times to see what happens, and if your risk of error is less than what it should be or not. Especiallt when P(win)+P(draw)/2 is close to 50%...

For that you'd have to take the bit of code in BayesElo that calculates the LOS, and feed in random numbers into it.

Perhaps Remi Coulom is best placed to give some insights on this
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

Ok here is a quick hack that compares the performance of a Wald sequential test with a traditional test of fixed length.

Image

This is for a test that detects a 5 elo difference with 80% probability with
a less than 5% probability of a false positive.

I must say I am a bit shocked by the number of samples that are necessary,
to detect a 5 elo difference so it would be nice if someone would confirm this.

The code to generate the graph is here

http://alpha.uhasselt.be/Research/Algebra/Toga/test.py
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Michel wrote:Ok here is a quick hack that compares the performance of a Wald sequential test with a traditional test of fixed length.

Image

This is for a test that detects a 5 elo difference with 80% probability with
a less than 5% probability of a false positive.

I must say I am a bit shocked by the number of samples that are necessary,
to detect a 5 elo difference so it would be nice if someone would confirm this.

The code to generate the graph is here

http://alpha.uhasselt.be/Research/Algebra/Toga/test.py
I agree with all your results
Here's my code (the relevant parts at least)

Code: Select all

static double log_likelyhood, pw1, pw0, pd1, pd0, pl1, pl0, a, b;

void proba_elo&#40;double elo, double *pwin, double *pdraw, double *ploss&#41;
/* from BayesElo&#58; computes probabilities of win and draw, given the elo difference. What this does
 * is basically make a realistic assumption on &#40;pwin,pdraw&#41; for a given elo difference, that is for
 * a given theoretical score pwin+pdraw/2 */
&#123;
	static const double draw_elo = 97.3;
	*pwin  = 1 / &#40;1 + pow&#40;10, (-elo + draw_elo&#41; / 400&#41;);
	*ploss = 1 / &#40;1 + pow&#40;10, ( elo + draw_elo&#41; / 400&#41;);
	*pdraw = 1.0 - *pwin - *ploss;
&#125;

void init_SPRT&#40;double elo0, double elo1, double alpha, double beta&#41;
&#123;
	log_likelyhood = 0.0;
	proba_elo&#40;elo0, &pw0, &pd0, &pl0&#41;;
	proba_elo&#40;elo1, &pw1, &pd1, &pl1&#41;;
	
	a = log&#40;beta / &#40;1 - alpha&#41;);
	b = log&#40;&#40;1 - beta&#41; / alpha&#41;;
&#125;

int stop_SPRT&#40;int game_result&#41;
&#123;
	if &#40;1 == game_result&#41;
		log_likelyhood += log&#40;pw1/pw0&#41;;
	else if &#40;0 == game_result&#41;
		log_likelyhood += log&#40;pd1/pd0&#41;;
	else
		log_likelyhood += log&#40;pl1/pl0&#41;;
	
	if &#40;log_likelyhood > b&#41;
		return 1;	// accept H1&#58; elo_diff >= elo1
	else if &#40;log_likelyhood < a&#41;
		return -1;	// accept H0&#58; elo_diff <= elo0
	else
		return 0;	// continue sampling
&#125;
Another important thing to notice is how the stopping time is spread out. It would be interesting if you could pliot (2nd graph) the 5% and 95% quantiles of the running time of the test... They distributions of the stopping time are very spread out and very skewed.
User avatar
marcelk
Posts: 348
Joined: Sat Feb 27, 2010 12:21 am

Re: cutechess-cli suggestion (to Ilari)

Post by marcelk »

Michel wrote: Image

This is for a test that detects a 5 elo difference with 80% probability with
a less than 5% probability of a false positive.

I must say I am a bit shocked by the number of samples that are necessary,
to detect a 5 elo difference so it would be nice if someone would confirm this.
~18000 doesn't look out of the ordinary at all. But to be practical: many changes are smaller than 5 elo and thus the speedup is not that big.

And then there are the errors that the experiments add, for example biassed opening selection and ignoring draws.. (OTH, there is more information we could get out of a sample but gets ignored now: the game length). So whatever we try to squeeze, we're likely moving within the noise margin anyway.

Mind that accepting 20% errors means that one out of 5 changes get misqualified. This leads to the next question: what is the most optimal use of hardware resources to vet a sequence of ~5 elo change candidates? A high confidence level means long tests, slowing progress down. A low confidence level means more steps backwards, slowing progress down. There must be an optimum in between these somewhere...
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

Another important thing to notice is how the stopping time is spread out. It would be interesting if you could pliot (2nd graph) the 5% and 95% quantiles of the running time of the test... They distributions of the stopping time are very spread out and very skewed.
It seems you are right. I found a formula for the distribution of the stopping time
(as an infinite sum). It seems that in the example I gave the 95% quantile for the worst possible elo difference (3.12) is around 50000 (the expectation value was 18000). On the other hand the 50% quantile is 13800 which is well below the expectation value.

I will post a graph but first I have to invert the distribution to get the quantiles.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

Image

Ok here is a similar graph which includes the 95% quantile for the stopping
time for the Wald test. As observed by Lucas sometimes the number of
required samples can be quite high.

The code is still here

http://alpha.uhasselt.be/Research/Algebra/Toga/test.py
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Michel wrote:Image

Ok here is a similar graph which includes the 95% quantile for the stopping
time for the Wald test. As observed by Lucas sometimes the number of
required samples can be quite high.

The code is still here

http://alpha.uhasselt.be/Research/Algebra/Toga/test.py
Beautiful
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

I finally went to the library and got hold of a copy of Wald's book on sequential analysis. Very well written! Lots of nice formulas. A recommended read.

His formula for the stopping time distribution is also an infinite sum but at first sight it looks a rather different from mine. I think that is because I used the continuous approximation to the problem (the diffusion equation) whereas he is doing the discrete case (I am sure my formula is also somewhere in the literature).

Since in the discrete case you have to assume that the random variables are normal, you are making an approximation in that case as well. And of course in the discrete case there is also the "excess over the boundary": the random walk may overshoot the end points of the interval. The excess over the boundary is hard to handle and is usually neglected.

I compared some consequences of the continuous and discrete approximations and the results are the same for all practical purposes.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

To close this discussion: I was curious what the effect would be of truncating the Wald test. So I did the math and extended my script to compute this as well. Here is the result of truncating a test with alpha=0.05 beta=0.20 epsilon=5 at a sample size of 28000.

Image

28000 is pretty much the minimum. With smaller maximal sample size it becomes impossible to satisfy the design requirements (alpha,beta,epsilon).

Also: I think no test can go below the required sample size for the fixed length test which in this case is 25350.

One more comment. If you truncate the test you have to decide which of H0,H1 should be accepted after truncation. Wald recommends accepting the hypothesis with the highest likelihood but this seems to be suboptimal in case alpha and beta are different. So my script computes the optimal threshold value for the log likelihood ratio at the truncation point. In the above example it is 0.9753684 (note that a,b are also no longer computed with the usual simple rule, in this case they are a=-2.06889, b=3.2629476).

As before the script is here

http://alpha.uhasselt.be/Research/Algebra/Toga/test.py
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

I cleaned up my script a bit, renamed it "wald", and gave it a command line interface

Code: Select all

vdbergh@pl&#58;~/SRC/TESTS$ ./wald --help
Usage&#58; wald &#91;OPTION&#93;...
Generate parameters for a truncated Wald test and optionally
produce an informative graph.

  --alpha       type I error probability when there is no elo difference
  --beta        type II error probability with elo difference equal to epsilon
  --epsilon     see --beta
  --truncation  truncate test after this many observations
  --graph       produce graph
  --draw_elo    as in BayesElo 
  --help        print this message
All parameters have sensible default values

Code: Select all

vdbergh@pl&#58;~/SRC/TESTS$ ./wald --beta 0.10 --epsilon 1
Design parameters
=================
False positives            &#58;   5.0%
False negatives            &#58;  10.0%
Resolution                 &#58;  1.0 Elo
Truncation                 &#58;  965214 games
DrawElo                    &#58;  97.3 Elo

Runtime parameters
==================
Win                        &#58;  +0.0036600181033351
Draw                       &#58;  -0.0000076670052621
Loss                       &#58;  -0.0036676851085970
H1 threshold               &#58;  +3.4649761213696277
H0 threshold               &#58;  -2.8305309764459525
Truncation point           &#58;  965213
H1 threshold at truncation &#58;  +0.5179613557298325

Characeristics
==============
Max expected # of games    &#58;  688577 games
Savings                    &#58;  21.5%
(the savings can be increased by setting a later truncation point,
at the cost of more uncertainty about the total duration of the test).

The accompanying graph can be produced using the command

Code: Select all

./wald --beta 0.10 --epsilon 1 --graph
This yields

Image

The script is here

http://alpha.uhasselt.be/Research/Algebra/Toga/wald

Don't use it to control a nuclear power plant.