cutechess-cli suggestion (to Ilari)

Zlaire · Post by **Zlaire** » Mon Nov 07, 2011 6:26 pm

Lucas, your "warmup" is compared to ngames instead of ngames-(nwim+ndraw).

Does that make a difference in your setup? (your result very close to mine without warmup)

lucasart · Post by **lucasart** » Mon Nov 07, 2011 6:55 pm

Zlaire wrote:Lucas, your "warmup" is compared to ngames instead of ngames-(nwim+ndraw).

Does that make a difference in your setup? (your result very close to mine without warmup)

Yes, that's how I intended to implement it. Just do not check the stopping condition until 20 games have been played. Beware though that the condition ngames -nwin-ndraw is not symetric, whilst the problem a priori is (we're equqlly looking to see if A is better or if B is better. In reality we don't know that a priori).
So basically you're saying your modified stopping condition is: do not use the LOS stop until at least 20 games have been lost for whoever has the lead ?
You're right, empirically this should be better. As for being theoretically sound, no idea.

Thanks for the suggestion, I'll test that.

Lucas

lucasart · Post by **lucasart** » Mon Nov 07, 2011 7:16 pm

lucasart wrote: There are lots of mistakes and imprecisions in this paper

A bit of a harsh statement, mistypes happen, and they had to reduce the whole thesis in a few pages, so it's easy to forget things the reader needs. I found the full thesis, which is much easier to understand, as it is more detailled.
http://www.cs.toronto.edu/~vmnih/

Lucas

Michel · Post by **Michel** » Mon Nov 07, 2011 8:13 pm

Except for negative criticism, I haven't seen any real contribution from you here... So please avoid posting, unless you really have something constructive and inteliggent to say. Just common courtesy.

Speaking of negativity: you were the one that blasted everyone for being interested in perft implementations! That was not "common courtesy" was it? Not to speak of your unfounded and uninformed criticism of polyglot.

The only thing I did in this thread was to illustrate graphically that your idea is wrong (other people had already said this btw, but you asked for a proof). And I pointed out some elementary facts about designing tests which some people perhaps do not know (there is obviously confusion about the meaning of confidence level). So where is the negative criticism?

Zlaire · Post by **Zlaire** » Mon Nov 07, 2011 9:10 pm

So having done some more testing I see the problem quite clearly. It's exactly what Michel said.

With my setup I can get very close to 100% (99.98%) correct conclusions with a win rate of atleast 55% for either engine.

Awesome, so if we can be sure that the win rate will be at least 55% this works perfectly. But that's an assumption that can never be made really.

The problems start when we reduce the win rate.

With two equal engines (50% win rate), this method will conclude that one side has 99% LOS, 30% of the time. For this method to work it needs to be very close to 0%.

So since when you're testing an engine you can't be sure of anything really, you're extremely likely to run into a false positive.

Hence you have no idea what your result actually says. Either there was a true 55% win rate and you hit your 99.98%, or there was a true 50% win rate and you hit those 30%.

I.e. the test in rendered completely useless.

lucasart · Post by **lucasart** » Mon Nov 07, 2011 9:27 pm

lucasart wrote:
lucasart wrote: There are lots of mistakes and imprecisions in this paper
A bit of a harsh statement, mistypes happen, and they had to reduce the whole thesis in a few pages, so it's easy to forget things the reader needs. I found the full thesis, which is much easier to understand, as it is more detailled.
http://www.cs.toronto.edu/~vmnih/

Lucas

I tried the EBStop algorithm detailled in the paper, with
P(win)=0.4
P(draw)=0.3
epsilon = 1
delta = 0.05

and it seems to take a *lot* of games to reach a conclusion... really disapointing. Perhaps epsilon = 1 is not compatible with it, as suggested by a lot of terms in 1-epsilon in the paper (they never explain clearly what values of epsilon are acceptable).

so in the case of chess (trinomial variable, epsilon = 1), the best results I got so far were with the very simple NAS algorithm (detailled in the paper too).

has anyone tested EBStop and NAS too ? Perhaps my implementation is incorrect ?

lucasart · Post by **lucasart** » Mon Nov 07, 2011 9:32 pm

lucasart wrote:Now some GOOD NEWS. I tested the NAS algorithm presented in the paper. There are lots of mistakes and imprecisions in this paper, so I had to correct the stopping condition and assume epsilon = 1 (which is the only case we're interested in here).

=> Code
Code: Select all
Sub test()
    Randomize
    Const delta# = 0.05, p# = 0.4, q# = 0.3, N = 10000&
    Dim t&, X&, SX&, Xbar#, alpha#, Awins&, Bwins&, i&

    Awins = 0: Bwins = 0
    For i = 1 To N
   
        SX = 0: t = 0
        Do
            t = t + 1
            X = generate(p, q)
            SX = SX + X
            Xbar = SX / t
            alpha = Sqr(0.5 / t * Log(t * (t + 1) / delta))
        Loop Until alpha < Abs(Xbar)
   
        If Xbar > 0 Then
            Awins = Awins + 1
        Else
            Bwins = Bwins + 1
        End If
    Next i

    Debug.Print "Awins: " & Awins, "Bwins: " & Bwins
   
End Sub 
=> Results

It works!
Code: Select all
Awins: 9600   Bwins: 400 
but it takes a lot more games to reach a conclusion, and the number of games to reach a conclusion is very very variable.

=> Next step

I'll test the EBStop algorithm. First I need to figure out how to compute the dt, which is not clearly explained in the paper. I'll probably find the answer in the full thesis paper.

Lucas

I made a mistake, it should be

Code: Select all

Loop While Abs(Xbar)<2*alpha

Michel · Post by **Michel** » Fri Nov 11, 2011 11:24 am

It might be worth to have a look at the sequential Wald test.

http://en.wikipedia.org/wiki/Sequential ... ratio_test

The reason for the simple form of this test is that in this case the region on which the decision depends is bounded by two horizontal lines (instead of by something like n|--> a sqrt(n)).

The probability that a random walk crosses one of two horizontal lines can be easily computed in closed form. This seems to be the only type of region for which an exact analysis is possible.

lucasart · Post by **lucasart** » Fri Nov 11, 2011 1:20 pm

Michel wrote:It might be worth to have a look at the sequential Wald test.

http://en.wikipedia.org/wiki/Sequential ... ratio_test
The reason for the simple form of this test is that in this case the region on which the decision depends is bounded by two horizontal lines (instead of by something like n|--> a sqrt(n)).

The probability that a random walk crosses one of two horizontal lines can be easily computed in closed form. This seems to be the only type of region for which an exact analysis is possible.

I did a fair amount of research and experimentation. This problem is typically resolved by taking a sequence d(t) such that sum d(t) <= delta, where delta is the error margin allowed. then you write with proba at least 1-d(t) that:
|mu_hat(t)-mu| < sth depending on d(t)
using for example Hoeffding's inequality, or the new state-of the-art Empirical Bernstein inequality.
at this point you can do an algo that stops at any t such that the inequality doesn't hold. See the thesis paper of Volodymyr for further explanations.

The probklem with this approach is that the inequality used (Hoeffding or Empirical Bernsetin) are way too conservative (because they are true for any distribution almost surely bounded).

So I did a mixed approach, using the principle, but using an inequality based on approximated quantiles using the gaussian asymptotic. I do some sanity checks to ensure the gaussian approximates my quantiles well enough. Really it's a little hacky, but in practice, it beats any algorithm presented in the thesis paper, *in the particular case of chess* (trinomial case). Here's the code

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

#define max(x,y) ((x) > (y) ? (x) : (y))
#define min(x,y) ((x) < (y) ? (x) : (y))

#define PI 3.14159265358979323846
static const double a = 8*(PI-3)/(3*PI*(4-PI));

double qnorm(double q)
// quantile function of the normal distribution N(0,1)
{
	double x = 2*q-1;
	double y = log(1-x*x);
	double z = 2/(PI*a) + y/2;
	return sqrt(2) * (x>0 ? 1 : -1) * sqrt(sqrt(z*z - y/a) - z);
}

uint64_t rand64()
// 64-bit LCG by Donald Knuth
{
   static uint64_t seed = 0;
   return seed = seed * 6364136223846793005ULL + 1442695040888963407ULL;
}

int get_game(double pwin, double pdraw)
{
	double u = (double)(rand64() & 0xffffffff) / 0x100000000;
	return u < pwin ? 1 : (u < pwin+pdraw ? 0 : -1);
}

int my_stop(double pwin, double pdraw, double delta, unsigned *nb_games)
{
	int nwin = 0, ndraw = 0, nloss = 0, t0 = 0;
	
	for (unsigned t = 1; ; t++) {
		int x = get_game(pwin, pdraw);
		if (1 == x)
			nwin++;
		else if (0 == x)
			ndraw++;
		else
			nloss++;
		
		const double mu = (double)(nwin - nloss) / t;
		if (fabs(mu)*t > 1/delta) {
			t0 = t0 ? t0 : t-1;
			const double sigma = sqrt((nwin*(1-mu)*(1-mu) + ndraw*mu*mu + nloss*(1+mu)*(1+mu))/t);
			
			static const double p = 1.05;
			const double d = delta*(p-1)/p*pow(t-t0,-p);
			const double mu_err = qnorm(1-d)*sigma/sqrt(t);
			
			if (fabs(mu) > mu_err) {
				*nb_games = t;
				return mu > 0 ? 1 : -1;
			}
		}
	}
}

int main(int argc, char ** argv)
{
	const int N = 1000;
	unsigned nb_games, total_games = 0;
	int S = 0;
	const double pwin = atof(argv[1]), pdraw = atof(argv[2]), delta = atof(argv[3]);
	
	for (int i = 0; i < N; i++) {
		int x = my_stop(pwin, pdraw, delta, &nb_games);
		S += x; total_games += nb_games;
	}
	
	printf("A wins: %d\tB wins: %d\tavg(games): %d\n", (N+S)/2, (N-S)/2, total_games/N);
}

I also tried a bootstrap method to estimate the confidence interval of mu_hat(t), rather than by using gaussian quantiles. It works very well too (actually even better), but it's really hacky, so I'd rather keep this more purified approach, which is *almost* mathematically sound.

Lucas

Michel · Post by **Michel** » Sat Nov 12, 2011 12:25 pm

Apparently the sequential likelihood ratio test is optimal with respect
to sample size (in a suitable sense).

There is a proof in this document.

http://nowak.ece.wisc.edu/ece830/ece830 ... cture9.pdf

cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)