cutechess-cli suggestion (to Ilari)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

cutechess-cli suggestion (to Ilari)

Post by lucasart »

Hello Ilari,

Thanks again for cutechess-cli. I really like it!

Here's a feature suggestion that could be nice: output some basic statistics. Foe example when you show a result like "44-54-31 [0.46]", we (engine testers) typically want to know if this score is statistically significant to decide whether A is better than B. Or (even better) at what confidence level it would be significant.

Let's say we have
p wins
q draws
N-p-q losses

let's note
P = p/N
Q = q/N

then asymptotically the distribution of the result is a gaussian with
* unbiaised estimator of the mean: mu = P+Q/2
* unbiaised estimator of the variance:sigma^2 = [P(1-mu)^2 + Q(.5-mu)^2]/(|N-1)

then if I chose a confidence level, say alpha=95%, a confidence interval is
[mu (+/-) Phi^{-1}(1+apha/2)*sigma]
where Phi is the cumulated distribution function of the Normal(0,1), and Phi^{-1{ it's inverse (ie. quantile function)

You could either hardcode a confidence level alpha=95% for example. Or make it a parameter, meaning you'd have to code a function Phi^{-1}.

If you're interested I'll send you a C function to compute gaussian quantiles. The code would be very simple, I just need to plough through my books and find it.

Also in the same idea, once you have an asymptotic distribution of the result you can produce a likelyhood of superiority as well.

PS: In fact, I'll send you the whole C code, as a function of (p,q,N) if you're interested. I'll go to my math books, find the formulas and test it.

Lucas
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

also another feature, which derives from the previous one quite simply is the following:

rather than playing 200 games, let's say.

maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.

the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.

this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.

Let me know what you think.

Lucas
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Here's all the code required (written in plain C)to do all this. Hope it helps!

Code: Select all

static const double a = 8*(PI-3)/(3*PI*(4-PI));

double erf(double x)
// error function: good approximation especially close 0 and +/-infinity
{
	return (x>0 ? 1 : -1) * sqrt(1-exp(-x*x * (4/PI + a*x*x) / (1 + a*x*x)));
}

double erf_inverse(double x)
// inverted error function
{
	double y = log(1-x*x), z = 2/(PI*a) + y/2;
	return (x>0 ? 1 : -1) * sqrt(sqrt(z*z - y/a) - z);
}

double pnorm(double x)
// cumulated distribution function of the normal distribution N(0,1)
{
	return .5 * (1 + erf(x / sqrt(2)));
}

double qnorm(double q)
// quantile function of the normal distribution N(0,1)
{
	return sqrt(2) * erf_inverse(2*q - 1);
}

void calc_stats(int wins, int draws, int N, double alpha, double &confidence, double &LOS)
/*
 * A plays against B for N games. Take the number of wins and draws for A, and a confidence level
 * alpha (eg. alpha=0.95 to allow 5% error). We compute the confidence interval and the probability
 * that A is better than B (Likelyhood Of Superiority)
 * 
 * Example: 50 wins, 30 draws, 120 games, alpha = 95%
 * mu = 54.17%		score of A)
 * sigma = 3.95%	stdev of the random variable mu)
 * confidence = qnorm(97.5%)*sigma = 7.74%:	with proba 95% the true score of A is in the interval
 * mu +/- confidence
 * LOS = pmorm((mu-.5)/sigma) = 85.42%: probability that A is stronger than B (which is below alpha in
 * this case, so it's not significant)
 * */
{
	double w = wins/N, d = draws/N, l = 1-w-d;
	double mu = w+d/2;
	double sigma = sqrt((w*(1-mu)*(1-mu) + d*(.5-mu)*(.5-mu) + l*(0-mu)*(0-mu))/(N-1));
	
	*confidence = sigma*qnorm(.5*(1+alpha));
	*LOS = pnorm((mu-.5)/sigma);
}
Would be really nice to implement the stop condition based on LOS rather than number of games. That would bring automated engine testing a step further ;-)

Lucas
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

I forgot an important detail: parameter validity
* wins and draws must be positive (obviously)
* N must be >= wins+draws (otherwise there is a negative number of losses)
* N must be >= 2, otherwise the calculation of sigma will crash. In fact, these stats are meaningless until N is big enough for the asymptotic assumption to be precise enough. Let's say N >= 20
* 0.5 <= alpha < 1. alpha = 1 will obviously crash, since qnorm(1) = +infinity. typically statisticians chose alpha = 95%

also, as a test, the qnorm and pnorm functions should give the same results than the functions normsinv and normsdist defined in Libre Office (or MS Excel)
User avatar
Zlaire
Posts: 62
Joined: Mon Oct 03, 2011 9:40 pm

Re: cutechess-cli suggestion (to Ilari)

Post by Zlaire »

lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
User avatar
marcelk
Posts: 348
Joined: Sat Feb 27, 2010 12:21 am

Re: cutechess-cli suggestion (to Ilari)

Post by marcelk »

Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: cutechess-cli suggestion (to Ilari)

Post by Michel »

Is there a good reason for cutechess-cli to duplicate the functionality of bayeselo?
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

marcelk wrote:
Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.
Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.
Perhaps Remi Coulom would be able to answer that ?
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: cutechess-cli suggestion (to Ilari)

Post by Rémi Coulom »

lucasart wrote:
marcelk wrote:
Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.
Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.
Perhaps Remi Coulom would be able to answer that ?
A lecture by a poker programmer on this topic:
http://videolectures.net/icml08_mnih_ebs/
http://icml2008.cs.helsinki.fi/papers/523.pdf

That one is interesting too:
http://videolectures.net/icml09_igel_hbrs/

Rémi
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: cutechess-cli suggestion (to Ilari)

Post by lucasart »

Rémi Coulom wrote:
lucasart wrote:
marcelk wrote:
Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.
Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.
Perhaps Remi Coulom would be able to answer that ?
A lecture by a poker programmer on this topic:
http://videolectures.net/icml08_mnih_ebs/
http://icml2008.cs.helsinki.fi/papers/523.pdf

That one is interesting too:
http://videolectures.net/icml09_igel_hbrs/

Rémi
Thank you Remi. That is exactly what I was looking for!

What I really want is:
* output an asymptitic estimate of the LOS and the confidence interval, since it takes very little programming (it's basically my sample code). that's not really crucial, but I thought it would be nice to have.
* determine a stopping strategy. This really has to be implemented in cutechess-cli, and is not duplicating BayesElo as someone suggested. BayesElo takes a PGN and produces statistics. I want to do a test of A vs B, and stop whenever enough games have been played to be able to draw a statistically significant conclusion.

As I understand this is what the Empirical Bernstein Stopping algorithm does, right ? I'll print out the PDF and read it in details tomorrow. But basically I would like to implement this in cutechess-cli, to automate and rationalise my testing. And I *will* ;-)

I may have some questions to ask you along the way, if there are things I don't understand. I will however try to figure things out by myself before asking silly questions.

First question: R = 3 in this case right ? Just want to make sure I'm reading this correctly.

Lucas