cutechess-cli suggestion (to Ilari)

lucasart · Post by **lucasart** » Sat Nov 12, 2011 7:16 pm

Michel wrote:
So that's why, I insist on saying that discussing things theoretically is not enough. If you don't experiment, you'll never understand if and how it works... (or doesn't work in this case)
This is simply not true. Having a theoretical answer you know the outcome
of the experiment, so no need to do it (except perhaps for confirmation). The opposite is not true.

In this case you can compute what happens. In fact I predicted the behaviour in 3) in my previous post. To make this manageable you have to truncate the SPRT.

Now to compare two test you really need to compare the Type I and Type II error probabilities in function of the parameter P. For the (truncated) SPRT
you can compute these. Can you do it for your test?

I read your PDF and the expected stopping time. The problem is that we need more than the expected value, we need quantiles. Here's a practical demonstration of why this idea cannot work:
* I call my program with 3 params P(win) P(draw) delta (hence delta/2 type I and delta/2 type II error threshold). elo0=0 and elo1=+10, so testing whether A beats B by 10 elo or more is what I'm really looking at.
* Sample runs:

Code: Select all

lucas@megatron&#58;~/Chess/DoubleCheck/StopAlgo/Release$ ./StopAlgo .4 .2 .05
min = 758	max = 30926	avg = 7353	win = 0.038000
lucas@megatron&#58;~/Chess/DoubleCheck/StopAlgo/Release$ ./StopAlgo .4 .228 .05
min = 796	max = 29506	avg = 6429	win = 0.987000
lucas@megatron&#58;~/Chess/DoubleCheck/StopAlgo/Release$ ./StopAlgo .4 .21 .05
min = 802	max = 76301	avg = 12142	win = 0.337000

* first example elo=elo0, and SPTR works like a charm. observe the min and max values over 1000 runs of the stopping time
* second example elo=elo1(+epsilon), and SPTR works fine too. observe the min and max values over 1000 runs of the stopping time
* third example, right in the danger zone elo0 < elo < elo1. SPTR doesn't makle sense here, as discussed. observe the min and max values over 1000 runs of the stopping time.
* now if you look at all these stopping time min/max, you can see that there's far too much overlapping for your idea to work. So really testing elo=0 vs elo!=0 is *not* the same as the sequential Wald test. So there is a good reason why there is so much theory on the test that I'm looking at, and the paper of Volodymyr (on Empirical Bernstein stopping) doesn't even mention the Wald test...

Michel · Post by **Michel** » Sat Nov 12, 2011 8:12 pm

So really testing elo=0 vs elo!=0 is *not* the same as the sequential Wald test.

Theoretically you are right.

However in finite time cannot test for elo!=0 (you would not be able to detect elo=0.00000001 in any sensible time frame). So in practice you will always be testing elo=0 versus elo>=epsilon.

If I understand correctly the sequential Wald test is optimal under these conditions.

I'm looking at, and the paper of Volodymyr (on Empirical Bernstein stopping) doesn't even mention the Wald test...

I looked at that paper. As far as I can tell it seems be about parameter estimation of random variables with unknown distribution. A far more
complicated problem than just testing elo != 0. So the Wald test does not appear relevant for this problem.

Michel · Post by **Michel** » Sat Nov 12, 2011 8:45 pm

I deleted a post of mine because there was a stupid mistake in it.

lucasart · Post by **lucasart** » Sat Nov 12, 2011 9:44 pm

Michel wrote:
So really testing elo=0 vs elo!=0 is *not* the same as the sequential Wald test.
Theoretically you are right.

However in finite time cannot test for elo!=0 (you would not be able to detect elo=0.00000001 in any sensible time frame). So in practice you will always be testing elo=0 versus elo>=epsilon.

If I understand correctly the sequential Wald test is optimal under these conditions.

I'm looking at, and the paper of Volodymyr (on Empirical Bernstein stopping) doesn't even mention the Wald test...
I looked at that paper. As far as I can tell it seems be about parameter estimation of random variables with unknown distribution. A far more
complicated problem than just testing elo != 0. So the Wald test does not appear relevant for this problem.

So we agree on all that:
* when elo is outside (0,epsilon) the Wald test is "reliable" and "fast".
* when elo is between 0 and epsiloin, the Wald test doesn't work. But in practice you would only conclude that elo>=epsilon or elo <=0 ion this case, so in elo terms you'd make an error of at most epsilon.

However, the following question has yet to be answered (at least empirically in the simple trinomial case of chess):
* when you want to make epsilon very small (say 1 elo) to avoid errors in the Wald test. What happens to the stopping average time at the bounds (elo=0 and elo=epsilon) ? Does it perform worse or better than the empirical bernstein stopping rule ? I'll run some tests on that.

One last thing I don't agree with:
* the epsilon delta stopping problem is very relevant here. Testint elo != 0 is exactly finding an (epsilon,delta) stopping algorithm for epsilon = 1, no ?

Michel · Post by **Michel** » Sat Nov 12, 2011 10:14 pm

However, the following question has yet to be answered (at least empirically in the simple trinomial case of chess):
* when you want to make epsilon very small (say 1 elo) to avoid errors in the Wald test. What happens to the stopping average time at the bounds (elo=0 and elo=epsilon) ? Does it perform worse or better than the empirical bernstein stopping rule ? I'll run some tests on that.

This is indeed a defect of the Wald test I noticed. If you set it up to detect small Elo differences then the running time becomes very long even if
the Elo difference is large. (It is very easy to estimate the running time
of the Wald test).

One last thing I don't agree with:
* the epsilon delta stopping problem is very relevant here. Testint elo != 0 is exactly finding an (epsilon,delta) stopping algorithm for epsilon = 1, no ?

I am not saying the paper is not relevant. I am saying the Wald test is not relevant to the purposes of the paper (since you said the Wald test is not
mentioned).

I looked at the paper. I am very surprised that the extremely crude estimates
they make can lead to an efficient algorithm (as you say). I have
not figured out yet how to translate their error estimates in terms
of Type I/II errors (to use common statistics terminology).

I would find it much more fun to evaluate the relevant Wiener integrals directly. Unfortunately it seems the Wald test is the only case that
can be treated exactly.

Michel · Post by **Michel** » Sun Nov 13, 2011 10:10 am

There is something else I do not understand in your approach to EBSTOP.
It is related to the fact that EBSTOP is really about parameter estimation
and not about hypothesis testing.

Under what condition do you decide to accept the null hypothesis (mu=0)?

Furthermore it seems to me that under the null hypothesis (mu=0) the expected runnning time is likely infinite (mu is in the denominator of the bounds).

So it appears one should introduce a cutoff after which one accepts the
null hypothesis. But this changes the resolution of the test. So in order not to compare apples with oranges one should quantify this.

lucasart · Post by **lucasart** » Sun Nov 13, 2011 11:36 am

Michel wrote:There is something else I do not understand in your approach to EBSTOP.
It is related to the fact that EBSTOP is really about parameter estimation
and not about hypothesis testing.

Under what condition do you decide to accept the null hypothesis (mu=0)?

Furthermore it seems to me that under the null hypothesis (mu=0) the expected runnning time is likely infinite (mu is in the denominator of the bounds).

So it appears one should introduce a cutoff after which one accepts the
null hypothesis. But this changes the resolution of the test. So in order not to compare apples with oranges one should quantify this.

An (epsilon,delta) algo will estimate mu with relative erorr epsilon. In practice, with epsilon=1, and scoring win=1, draw=0, loss=-1, we want a (1,delta) stopping algo to estimate mu=E(Xi). It will stop when Yt=(X1+...+Xn)/n verifies |Yt-mu| <= |mu|, that is to say when Yt has the same signum than mu, which precisely answers the question.

Of course, no algo could possibly answer the question mu = 0 in a finite time. I didn't look into mathematical demonstration of this, but "intuitively" it seems rather obvious. I'm sure one of the books in his bibliography proves this.

What is proven in the paper of Volodymyr is that EBStop stops almost surely stop whenever mu != 0. The case mu = 0 is not mentionned very much, other than to say that you can't possible handle it in finite time.

Now let's step back from theory and see things from a practical perspective. When I modify my program and want to test it. Sometimes I get a significant improvement, but often I don't, and there's no telling a priori. For example removing/adding an eval component, will often have a very small elo impact, unless I fixed a major bug in doing so. So I have no idea a priori whether elo = 0, -10, +10, -50, +50 etc.

I then run EBStop (or rather my bastardised version of it which converges faster in practice). I know that if it stops the decision is reliable with 5% error, no matter what elo difference is. Even if elo is only +1 (and in that case it will stop after a lot of games). If my patience is worn out before it stops, at least I will not be drawn into an incorrect conclusion like with SPRT. When EBStops terminates, the result is always very very reliable. On any test I did over 10,000 runs, it never made an error, no matter how close or far I was from mu 0 ( I tried plenty of combinations of P(win) and P(draw) in the random generator). And that was with delta = 5%. So it is conservative, but asymptotically efficient according to the paper of Volodymyr. However efficiency is not practical efficiency. It's theoretical efficiency. For example if algo X is efficient then algo Y which takes asymptotically a time T(Y,N) = 1000*T(X,N) + o(T(X,N)) is also efficient. Although in practice X is 1000 times faster. This is typically the case of EBStop. But in practice it is about as fast as the Wald test for epsilon=10 when elo is far out of (0,epsilon). Then we we get closer to the bounds, Wald is faster. Then inside (0,epsilon) EBStop is the only one that is reliable, so comparing doesn't even make sense. Of course the average stopping goes to infinity as elo reaches 0+ and it accelerates pretty fast. Even on a log scaled graph I have a convex curve when looking at E(stopping time) as a fuinction of mu.

Running SPRT in this example may conclude with a non negligible probability that elo >=10, which is incorrect. I would much prefer an algo that continues sampling than draws random conclusions. And when I'm fed up of sampling, I will stop artificially and say that the difference is too small to be measured in reasonable time, rather than make an erroneous conclusion with the Wald test.

lucasart · Post by **lucasart** » Sun Nov 13, 2011 11:56 am

Michel wrote:So it appears one should introduce a cutoff after which one accepts the
null hypothesis. But this changes the resolution of the test. So in order not to compare apples with oranges one should quantify this.

That's an very relevant point. In order to do this rigourously, I reckon we need:
* to estimate mu0 such that the stopping time quantile 1-delta for mu=m0 and for mu=-mu0 are <= the max number of runs that we allow ourselves.
* when the algo stops we know we have a correct conclusion with at least 1-delta probability.
* when it gets cutoff, we can conclude with proba at least 1-delta that |mu| < mu0.

Michel · Post by **Michel** » Sun Nov 13, 2011 1:14 pm

Running SPRT in this example may conclude with a non negligible probability that elo >=10, which is incorrect.

I really don't understand what you are saying here.

When the SPRT errs it _always_ errs on the safe side. The probability of a false positive (a type I error) will always be smaller than alpha.

The probability of a missed detection (a type II error) will be larger
than beta if 0< elo< elo1. But it can simply not be otherwise. If we approach
the null hypothesis then the probability of a type II error must become large
since we cannot detect elo != 0 in finite time.

lucasart · Post by **lucasart** » Sun Nov 13, 2011 1:18 pm

On second thoughts, despite its flaws, the Wald test is probably the best compromise in practice. Let's say we test H0: elo = 0 against H1: elo = epsilon. Say we fix a proba of error of type I and type II <= delta.

When H1 is accepted:
* you can say with probability at least 1-delta that A is stronger than B (elo >= 0).
* this is because when elo < 0, H1 cannot be accepted with proba more than delta. Between 0 and epsilon anything can be accepted cporrectly or incorrectly with non controlable probability. And above elo>epsilon, H1 will be accepted with probability at least 1-delta.

When H0 is accepted:
* you can say with proba at least 1-delta that program A is not stronger than program B by epsilon elo points or more.
* same story as before, replacing the zone elo>epsilon by the zone elo<0.

I did quite a few tests, and would like to attach some graph here, on average stopping time depending on the elo difference, for different values of epsilon. Any idea how to attach an image here ?[/img]

cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)