cutechess-cli suggestion (to Ilari)

lucasart · Post by **lucasart** » Sat Nov 12, 2011 1:03 pm

Michel wrote:Apparently the sequential likelihood ratio test is optimal with respect
to sample size (in a suitable sense).

There is a proof in this document.

http://nowak.ece.wisc.edu/ece830/ece830 ... cture9.pdf

Thank you. That's interesting.

In practice however, I don't understand how SPRT answers the question. All it does it test the hypothesiis P = P1 vs P = p0. What I want is to test E(P) = 0 versus E(P) != 0. And there is a big difference between the two problems:
1/ if I use Wald's SPRT, how do I chose P0 and P1 ? First there are an infinite number of P0 such that E(P0) = 0, because of drawn games... Second how do I chose P1 ?
2/ As I understand, there would be no way to incorporate draws in there, so they'd be discarded. The problem is that draws are an important part of the information , and discarding them will biais the result. Also, they reduce the bounds I use, and make the convergence faster on average. In other words, I get a faster stopping when A score 55% in the form 40% win and 30% draws, than in the form 55% win 0% draw.

SPRT was already discussed here, but the discussion was rather theoretical and no one actually provided any sample code to demonstrate it. I would be much more interested by a sample code of SPRT, where I can change the parameters (P(win) P(draw) delta=error level of the test) and see how fast it converges on average, and how much error it produces.

Has anyone actually managed to get SPRT to do that ?

Michel · Post by **Michel** » Sat Nov 12, 2011 2:37 pm

In practice however, I don't understand how SPRT answers the question. All it does it test the hypothesiis P = P1 vs P = p0. What I want is to test E(P) = 0 versus E(P) != 0.

If you think about it, this is the same thing. Assume that we want to test P=P0 (the null hypothesis) versus P !=P0 where P is an unknown parameter (say the elo difference).

Recall that there are two types of error associated with a test:

Type I: you reject the null hypothesis while it is true (this is generally the most serious error).
Type II: you accept the null hypothesis while it is false.

In this case the probability of a Type I error is a unique number associated with the test. Sadly, the probability of a Type II error depends on the unknown value of P. So one way to quantify it is to give the probability of
a Type II error in the case P=P1.

This is what the SPRT does. It fixes the probability of a type I error
as well as the probabilty of a type II error for a fixed value P1 of P.

In the elo example you could for example say: if there is an elo difference
of 10 then I want to detect this with 95% certainty, while if there is no
elo difference I also want to detect this with 95% certainty.

Of course then you can compute what your test will do in case P has
other values and make a nice graph with P on the horizontal axis
and the Type II error probability on the vertical axis.

As I understand, there would be no way to incorporate draws in there, so they'd be discarded.

This is rather incorrect. To compute the likelihood ratio you have to be able
to compute the probability of a draw from P (the elo difference in our example). Bayeselo has the same problem.

While in principle you cannot compute the probability of a draw from the elo difference, Bayeselo uses an approximation which seems to work well. It introduces a fixed constant D="drawelo"=97.3 such that

(1) win.prob= 1/(1+10^((-P+D)/400))
(2) loss.prob= 1/(1+10^((P+D))/400))
(3) draw.prob=1-win.prob-draw.prob.

Bayeselo incorporates the white/black difference in a similar way.

SPRT was already discussed here, but the discussion was rather theoretical and no one actually provided any sample code to demonstrate it. I would be much more interested by a sample code of SPRT, where I can change the parameters (P(win) P(draw) delta=error level of the test) and see how fast it converges on average, and how much error it produces

I think it is not really necessary to do tests in this case as SPRT can be analyzed theoretically (it is basically the gambler's ruin problem).

lucasart · Post by **lucasart** » Sat Nov 12, 2011 3:30 pm

Michel wrote: While in principle you cannot compute the probability of a draw from the elo difference, Bayeselo uses an approximation which seems to work well. It introduces a fixed constant D="drawelo"=97.3 such that

(1) win.prob= 1/(1+10^((-P+D)/400))
(2) loss.prob= 1/(1+10^((P+D))/400))
(3) draw.prob=1-win.prob-draw.prob.
Ah! This is the real key! Now the distribution of all the X_i is summarized by only one number p. So I set p0 = 0.5 and p1 such that p1 gives +10 elo using the above formulas, for example ?

But how would the average stopping time behave if I modify the 10 elo here ? (for a fixed level or type I or II error). 10 is completely arbitrary, maybe the true difference is 200 elo ? or maybe it's a tiny 3 elo ? the problem is that I don't know before the test

lucasart · Post by **lucasart** » Sat Nov 12, 2011 3:54 pm

Michel wrote:
In practice however, I don't understand how SPRT answers the question. All it does it test the hypothesiis P = P1 vs P = p0. What I want is to test E(P) = 0 versus E(P) != 0.
If you think about it, this is the same thing. Assume that we want to test P=P0 (the null hypothesis) versus P !=P0 where P is an unknown parameter (say the elo difference).

Recall that there are two types of error associated with a test:

Type I: you reject the null hypothesis while it is true (this is generally the most serious error).
Type II: you accept the null hypothesis while it is false.

In this case the probability of a Type I error is a unique number associated with the test. Sadly, the probability of a Type II error depends on the unknown value of P. So one way to quantify it is to give the probability of
a Type II error in the case P=P1.

This is what the SPRT does. It fixes the probability of a type I error
as well as the probabilty of a type II error for a fixed value P1 of P.

In the elo example you could for example say: if there is an elo difference
of 10 then I want to detect this with 95% certainty, while if there is no
elo difference I also want to detect this with 95% certainty.

Of course then you can compute what your test will do in case P has
other values and make a nice graph with P on the horizontal axis
and the Type II error probability on the vertical axis.

As I understand, there would be no way to incorporate draws in there, so they'd be discarded.
This is rather incorrect. To compute the likelihood ratio you have to be able
to compute the probability of a draw from P (the elo difference in our example). Bayeselo has the same problem.

While in principle you cannot compute the probability of a draw from the elo difference, Bayeselo uses an approximation which seems to work well. It introduces a fixed constant D="drawelo"=97.3 such that

(1) win.prob= 1/(1+10^((-P+D)/400))
(2) loss.prob= 1/(1+10^((P+D))/400))
(3) draw.prob=1-win.prob-draw.prob.

Bayeselo incorporates the white/black difference in a similar way.

SPRT was already discussed here, but the discussion was rather theoretical and no one actually provided any sample code to demonstrate it. I would be much more interested by a sample code of SPRT, where I can change the parameters (P(win) P(draw) delta=error level of the test) and see how fast it converges on average, and how much error it produces
I think it is not really necessary to do tests in this case as SPRT can be analyzed theoretically (it is basically the gambler's ruin problem).

OK so let me kn45ow if my understanding is correct:
* pw0, pw1 are0. the win proba with p=p0 and p=p1, using the above formula
* same story for pd0, pd1 (draw), pl0, pl1 (loss)
* the statistic itself is really 2 dimensiponal: %win, %draw. So I'm considering the random variable S(x1,...,xn) = (nb of xi=1 / n, nd of xi=0.5 /n)
* the log-likelyhood ratio is then (since the trinomial coeffients will simplify):
ln(L1/L0)(%w,%d) = (pw1-pw0).ln(%w) + (pd1-pd0).ln(%d) + (pl1-pl0).ln(1-%w-%d)

Now what do I do with that ? Let's say, I want a type I error probability of delta1 and type II err proba of delta2.

Michel · Post by **Michel** » Sat Nov 12, 2011 4:03 pm

Well the average stopping time depends on the real value of P of course (the elo difference). There are formulas in the literature for the average stopping time if P=P0 and P=P1 (see e.g the pdf above). If P<P0 or P>P1 then the test will stop quicker, so this is not a problem.

It becomes interesting if P0<P<P1. I think there is a value P=P2 where the
drift (the average change of the log likelihood ratio) becomes zero. While stopping is still guaranteed in this case the test will take much longer. (And of course the Type II error probability will be larger than the design value.)

So I guess in practice one has to introduce a cutoff (larger than the average stopping times for P=P0 and P=P1) in which case one accept the null hypothesis.

I need to analyze this in more detail, but I am a bit busy right now.

Michel · Post by **Michel** » Sat Nov 12, 2011 4:34 pm

the log-likelyhood ratio is then (since the trinomial coeffients will simplify):
ln(L1/L0)(%w,%d) = (pw1-pw0).ln(%w) + (pd1-pd0).ln(%d) + (pl1-pl0).ln(1-%w-%d)

I have difficulty parsing this formula. I think it should be (this forum should really implement mathml, as all decent forums do these days).

log(pw1/pw0).%w + log(pd1/pd0).%d + log(pl1/pl0).(1-%w-%d).

Here log(pw1/pw0), log(pd1/pd0), log(pl1/pl0) are precomputed numbers.
So you see that you are really doing a one-dimensional random walk. This is the key to analyzing the problem mathematically.

lucasart · Post by **lucasart** » Sat Nov 12, 2011 6:06 pm

Michel wrote:
the log-likelyhood ratio is then (since the trinomial coeffients will simplify):
ln(L1/L0)(%w,%d) = (pw1-pw0).ln(%w) + (pd1-pd0).ln(%d) + (pl1-pl0).ln(1-%w-%d)
I have difficulty parsing this formula. I think it should be (this forum should really implement mathml, as all decent forums do these days).

log(pw1/pw0).%w + log(pd1/pd0).%d + log(pl1/pl0).(1-%w-%d).

Here log(pw1/pw0), log(pd1/pd0), log(pl1/pl0) are precomputed numbers.
So you see that you are really doing a one-dimensional random walk. This is the key to analyzing the problem mathematically.

On second thoughts, we should really look at Xi not Xi/n, because the law of Xi is what we want to know, while the law of Xi/n depends on n... So the correct loglikelyhood ratio should be

Code: Select all

LLR = ln&#40;pw1/pw0&#41;.nb_win + ln&#40;pd1/pd0&#41;.nb_draw + ln&#40;pl1/pl0&#41;.nb_loss

I implemented the SPTR algorithm and tested it:

=> Code

Code: Select all

#include <math.h>

static double log_likelyhood, pw1, pw0, pd1, pd0, pl1, pl0, a, b;

static void proba_elo&#40;double elo, double *pwin, double *pdraw, double *ploss&#41;
/* from BayesElo&#58; computes probabilities of win and draw, given the elo difference. What this does
 * is basically make a realistic assumption on &#40;pwin,pdraw&#41; for a given elo difference, that is for
 * a given theoretical score pwin+pdraw/2 */
&#123;
	static const double draw_elo = 97.3;
	*pwin  = 1 / &#40;1 + pow&#40;10, (-elo + draw_elo&#41; / 400&#41;);
	*ploss = 1 / &#40;1 + pow&#40;10, ( elo + draw_elo&#41; / 400&#41;);
	*pdraw = 1.0 - *pwin - *ploss;
&#125;

void init_SPRT&#40;double elo0, double elo1, double alpha, double beta&#41;
&#123;
	log_likelyhood = 0.0;
	proba_elo&#40;elo0, &pw0, &pd0, &pl0&#41;;
	proba_elo&#40;elo1, &pw1, &pd1, &pl1&#41;;
	
	a = log&#40;beta / &#40;1 - alpha&#41;);
	b = log&#40;&#40;1 - beta&#41; / alpha&#41;;
&#125;

int stop_SPRT&#40;int game_result&#41;
&#123;
	if &#40;1 == game_result&#41;
		log_likelyhood += log&#40;pw1/pw0&#41;;
	else if &#40;0 == game_result&#41;
		log_likelyhood += log&#40;pd1/pd0&#41;;
	else
		log_likelyhood += log&#40;pl1/pl0&#41;;
	
	if &#40;log_likelyhood > b&#41;
		return 1;	// accept H1&#58; elo_diff = elo1
	else if &#40;log_likelyhood < a&#41;
		return -1;	// accept H0&#58; elo_diff = elo1
	else
		return 0;	// continue sampling
&#125;

So after a call to init_SPRT(), everytime you call stop_SPRT(), it returns:
* 0 if no conclusion can be drawn (continue sampling)
* 1 if H1 is accepted (elo=elo1)
* -1 if H0 is accepted (elo=elo0)

=> Results

In practice, what happens is the following:
1/ when the true elo difference is elo > elo1, H1 is accepted in 100% cases, which is correct.
2/ when the true elo difference is < elo0, H0 is accepted in 100% cases, which is correct.
3/ when the true elo difference is between elo0 and elo1, any kind of nonsense can come out. For example when testing elo_diff = 0 vs elo_diff = +10, with a random generator P(win)=0.4 P(draw)=0.2, I get 96% conclusions on H1, ie it concludes 96% of the time that B is stronger than A, although they have the exact same elo by construction.

In terms of speed of convergence:
Cases 1/ and 2/ offer a good speed of convergence. Best performance is achieved by SPTR when close to the bounds elo0 and elo1, while still outside the interval (elo0,elo1). When we are far from the bounds, my algorithm is the one that concludes the fastest however, and SPTR performs pretty much like the empirical bernstein algorithm.
Case 3/ SPTR concludes comparatively fast, but it gives any kind of nonsense. EBStop here is the most accurate (no mistake ever, for any elo diff != 0 even a tiny difference), and my hacky algo also makes very very few mistakes (typically between 0.0%-0.3% as we get close to elo0) and conclused about twice faster than EBStop.

Now in real life, I am testing A (my modified program) versus B (my previous version of the program). And I want to know if A is better than B, if it is equal, or if it is worse. With a margin of error delta=5%, let's say. I don't even care by how much elo yet, I just want a definite answer to that simple question, with the least error possible and a fairly small number of games.

If the true difference is between my arbitrary bounds elo0 and elo1, which happens a lot if not most of the times in reality. I'm completely screwed!

The problem is that you have no way of knowing ig you are in case 1 2 or 3 a priori, and even setting a stopping time maximum could certainly lead to a catastrophic number of errors in many cases.

So that's why, I insist on saying that discussing things theoretically is not enough. If you don't experiment, you'll never understand if and how it works... (or doesn't work in this case)

Michel · Post by **Michel** » Sat Nov 12, 2011 6:29 pm

So that's why, I insist on saying that discussing things theoretically is not enough. If you don't experiment, you'll never understand if and how it works... (or doesn't work in this case)

This is simply not true. Having a theoretical answer you know the outcome
of the experiment, so no need to do it (except perhaps for confirmation). The opposite is not true.

In this case you can compute what happens. In fact I predicted the behaviour in 3) in my previous post. To make this manageable you have to truncate the SPRT.

Now to compare two test you really need to compare the Type I and Type II error probabilities in function of the parameter P. For the (truncated) SPRT
you can compute these. Can you do it for your test?

Michel · Post by **Michel** » Sat Nov 12, 2011 6:33 pm

1/ when the true elo difference is elo > elo1, H1 is accepted in 100% cases, which is correct.
2/ when the true elo difference is < elo0, H0 is accepted in 100% cases, which is correct.

This can not be true. No finite test has an error ratio of zero.

lucasart · Post by **lucasart** » Sat Nov 12, 2011 7:04 pm

Michel wrote:
1/ when the true elo difference is elo > elo1, H1 is accepted in 100% cases, which is correct.
2/ when the true elo difference is < elo0, H0 is accepted in 100% cases, which is correct.
This can not be true. No finite test has an error ratio of zero.

obviously it's not mathematically zero. but over 1000 runs, with an error 2.5% for type I and 2.5% for type II I had zero errors on 1000 runs.

cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)