Lucas, your "warmup" is compared to ngames instead of ngames-(nwim+ndraw).
Does that make a difference in your setup? (your result very close to mine without warmup)
cutechess-cli suggestion (to Ilari)
Moderator: Ras
-
Zlaire
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
-
lucasart
- Posts: 3243
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
Yes, that's how I intended to implement it. Just do not check the stopping condition until 20 games have been played. Beware though that the condition ngames -nwin-ndraw is not symetric, whilst the problem a priori is (we're equqlly looking to see if A is better or if B is better. In reality we don't know that a priori).Zlaire wrote:Lucas, your "warmup" is compared to ngames instead of ngames-(nwim+ndraw).
Does that make a difference in your setup? (your result very close to mine without warmup)
So basically you're saying your modified stopping condition is: do not use the LOS stop until at least 20 games have been lost for whoever has the lead ?
You're right, empirically this should be better. As for being theoretically sound, no idea.
Thanks for the suggestion, I'll test that.
Lucas
-
lucasart
- Posts: 3243
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
A bit of a harsh statement, mistypes happen, and they had to reduce the whole thesis in a few pages, so it's easy to forget things the reader needs. I found the full thesis, which is much easier to understand, as it is more detailled.lucasart wrote: There are lots of mistakes and imprecisions in this paper
http://www.cs.toronto.edu/~vmnih/
Lucas
-
Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
Speaking of negativity: you were the one that blasted everyone for being interested in perft implementations! That was not "common courtesy" was it? Not to speak of your unfounded and uninformed criticism of polyglot.Except for negative criticism, I haven't seen any real contribution from you here... So please avoid posting, unless you really have something constructive and inteliggent to say. Just common courtesy.
The only thing I did in this thread was to illustrate graphically that your idea is wrong (other people had already said this btw, but you asked for a proof). And I pointed out some elementary facts about designing tests which some people perhaps do not know (there is obviously confusion about the meaning of confidence level). So where is the negative criticism?
-
Zlaire
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
Re: cutechess-cli suggestion (to Ilari)
So having done some more testing I see the problem quite clearly. It's exactly what Michel said.
With my setup I can get very close to 100% (99.98%) correct conclusions with a win rate of atleast 55% for either engine.
Awesome, so if we can be sure that the win rate will be at least 55% this works perfectly. But that's an assumption that can never be made really.
The problems start when we reduce the win rate.
With two equal engines (50% win rate), this method will conclude that one side has 99% LOS, 30% of the time. For this method to work it needs to be very close to 0%.
So since when you're testing an engine you can't be sure of anything really, you're extremely likely to run into a false positive.
Hence you have no idea what your result actually says. Either there was a true 55% win rate and you hit your 99.98%, or there was a true 50% win rate and you hit those 30%.
I.e. the test in rendered completely useless.
With my setup I can get very close to 100% (99.98%) correct conclusions with a win rate of atleast 55% for either engine.
Awesome, so if we can be sure that the win rate will be at least 55% this works perfectly. But that's an assumption that can never be made really.
The problems start when we reduce the win rate.
With two equal engines (50% win rate), this method will conclude that one side has 99% LOS, 30% of the time. For this method to work it needs to be very close to 0%.
So since when you're testing an engine you can't be sure of anything really, you're extremely likely to run into a false positive.
Hence you have no idea what your result actually says. Either there was a true 55% win rate and you hit your 99.98%, or there was a true 50% win rate and you hit those 30%.
I.e. the test in rendered completely useless.
-
lucasart
- Posts: 3243
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
I tried the EBStop algorithm detailled in the paper, withlucasart wrote:A bit of a harsh statement, mistypes happen, and they had to reduce the whole thesis in a few pages, so it's easy to forget things the reader needs. I found the full thesis, which is much easier to understand, as it is more detailled.lucasart wrote: There are lots of mistakes and imprecisions in this paper
http://www.cs.toronto.edu/~vmnih/
Lucas
P(win)=0.4
P(draw)=0.3
epsilon = 1
delta = 0.05
and it seems to take a *lot* of games to reach a conclusion... really disapointing. Perhaps epsilon = 1 is not compatible with it, as suggested by a lot of terms in 1-epsilon in the paper (they never explain clearly what values of epsilon are acceptable).
so in the case of chess (trinomial variable, epsilon = 1), the best results I got so far were with the very simple NAS algorithm (detailled in the paper too).
has anyone tested EBStop and NAS too ? Perhaps my implementation is incorrect ?
-
lucasart
- Posts: 3243
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
I made a mistake, it should belucasart wrote:Now some GOOD NEWS. I tested the NAS algorithm presented in the paper. There are lots of mistakes and imprecisions in this paper, so I had to correct the stopping condition and assume epsilon = 1 (which is the only case we're interested in here).
=> Code=> ResultsCode: Select all
Sub test() Randomize Const delta# = 0.05, p# = 0.4, q# = 0.3, N = 10000& Dim t&, X&, SX&, Xbar#, alpha#, Awins&, Bwins&, i& Awins = 0: Bwins = 0 For i = 1 To N SX = 0: t = 0 Do t = t + 1 X = generate(p, q) SX = SX + X Xbar = SX / t alpha = Sqr(0.5 / t * Log(t * (t + 1) / delta)) Loop Until alpha < Abs(Xbar) If Xbar > 0 Then Awins = Awins + 1 Else Bwins = Bwins + 1 End If Next i Debug.Print "Awins: " & Awins, "Bwins: " & Bwins End Sub
It works!but it takes a lot more games to reach a conclusion, and the number of games to reach a conclusion is very very variable.Code: Select all
Awins: 9600 Bwins: 400
=> Next step
I'll test the EBStop algorithm. First I need to figure out how to compute the dt, which is not clearly explained in the paper. I'll probably find the answer in the full thesis paper.
Lucas
Code: Select all
Loop While Abs(Xbar)<2*alpha
-
Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
It might be worth to have a look at the sequential Wald test.
The probability that a random walk crosses one of two horizontal lines can be easily computed in closed form. This seems to be the only type of region for which an exact analysis is possible.
The reason for the simple form of this test is that in this case the region on which the decision depends is bounded by two horizontal lines (instead of by something like n|--> a sqrt(n)).
The probability that a random walk crosses one of two horizontal lines can be easily computed in closed form. This seems to be the only type of region for which an exact analysis is possible.
-
lucasart
- Posts: 3243
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
I did a fair amount of research and experimentation. This problem is typically resolved by taking a sequence d(t) such that sum d(t) <= delta, where delta is the error margin allowed. then you write with proba at least 1-d(t) that:Michel wrote:It might be worth to have a look at the sequential Wald test.
The reason for the simple form of this test is that in this case the region on which the decision depends is bounded by two horizontal lines (instead of by something like n|--> a sqrt(n)).
The probability that a random walk crosses one of two horizontal lines can be easily computed in closed form. This seems to be the only type of region for which an exact analysis is possible.
|mu_hat(t)-mu| < sth depending on d(t)
using for example Hoeffding's inequality, or the new state-of the-art Empirical Bernstein inequality.
at this point you can do an algo that stops at any t such that the inequality doesn't hold. See the thesis paper of Volodymyr for further explanations.
The probklem with this approach is that the inequality used (Hoeffding or Empirical Bernsetin) are way too conservative (because they are true for any distribution almost surely bounded).
So I did a mixed approach, using the principle, but using an inequality based on approximated quantiles using the gaussian asymptotic. I do some sanity checks to ensure the gaussian approximates my quantiles well enough. Really it's a little hacky, but in practice, it beats any algorithm presented in the thesis paper, *in the particular case of chess* (trinomial case). Here's the code
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>
#define max(x,y) ((x) > (y) ? (x) : (y))
#define min(x,y) ((x) < (y) ? (x) : (y))
#define PI 3.14159265358979323846
static const double a = 8*(PI-3)/(3*PI*(4-PI));
double qnorm(double q)
// quantile function of the normal distribution N(0,1)
{
double x = 2*q-1;
double y = log(1-x*x);
double z = 2/(PI*a) + y/2;
return sqrt(2) * (x>0 ? 1 : -1) * sqrt(sqrt(z*z - y/a) - z);
}
uint64_t rand64()
// 64-bit LCG by Donald Knuth
{
static uint64_t seed = 0;
return seed = seed * 6364136223846793005ULL + 1442695040888963407ULL;
}
int get_game(double pwin, double pdraw)
{
double u = (double)(rand64() & 0xffffffff) / 0x100000000;
return u < pwin ? 1 : (u < pwin+pdraw ? 0 : -1);
}
int my_stop(double pwin, double pdraw, double delta, unsigned *nb_games)
{
int nwin = 0, ndraw = 0, nloss = 0, t0 = 0;
for (unsigned t = 1; ; t++) {
int x = get_game(pwin, pdraw);
if (1 == x)
nwin++;
else if (0 == x)
ndraw++;
else
nloss++;
const double mu = (double)(nwin - nloss) / t;
if (fabs(mu)*t > 1/delta) {
t0 = t0 ? t0 : t-1;
const double sigma = sqrt((nwin*(1-mu)*(1-mu) + ndraw*mu*mu + nloss*(1+mu)*(1+mu))/t);
static const double p = 1.05;
const double d = delta*(p-1)/p*pow(t-t0,-p);
const double mu_err = qnorm(1-d)*sigma/sqrt(t);
if (fabs(mu) > mu_err) {
*nb_games = t;
return mu > 0 ? 1 : -1;
}
}
}
}
int main(int argc, char ** argv)
{
const int N = 1000;
unsigned nb_games, total_games = 0;
int S = 0;
const double pwin = atof(argv[1]), pdraw = atof(argv[2]), delta = atof(argv[3]);
for (int i = 0; i < N; i++) {
int x = my_stop(pwin, pdraw, delta, &nb_games);
S += x; total_games += nb_games;
}
printf("A wins: %d\tB wins: %d\tavg(games): %d\n", (N+S)/2, (N-S)/2, total_games/N);
}
Lucas
-
Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
Apparently the sequential likelihood ratio test is optimal with respect
to sample size (in a suitable sense).
There is a proof in this document.
http://nowak.ece.wisc.edu/ece830/ece830 ... cture9.pdf
to sample size (in a suitable sense).
There is a proof in this document.
http://nowak.ece.wisc.edu/ece830/ece830 ... cture9.pdf