interpreting cutechess-cli SPRT + LOS output

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jswaff
Posts: 105
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

interpreting cutechess-cli SPRT + LOS output

Post by jswaff »

I'm trying to introduce a little more rigor in my testing process but I'm having trouble interpreting the results of a match between two versions of my program.

P1 = newer version , with small change expected to a nominal improvement at best. Perhaps a small loss.
P2 = best known version

A 20,000 game match is run with cutechess-cli using -sprt elo0=5 elo1=1 alpha=0.05 beta=0.05

Meaning:
* H1 : "P1 is stronger than P2 by at least 5 Elo points"
* H0 : "P1 is not stronger than P2 by at least 1 Elo point"
* There is a 5% chance of a Type 1 error and a 5% chance of a Type II error. IOW, there is a 5% chance of rejecting a change worth more than 5 Elo and a 5% chance of accepting a change with less than 1 Elo.

The match is terminated after 18,379 games, which supports the conclusion that the change is very minor. The match record:
5807 - 5491 - 7081 (0.509)

Code: Select all

SPRT: llr  -2.95  (-100.1%), lbound -2.94, ubound 2.94 - [b]H0 was accepted[/b]
I can see that the log likelihood ratio has fallen below the lower bound, triggering a stopping rule and accepting H0. In English that translates into "P1 is not stronger than P2 by at least 1 Elo point." (I think.)

I'm having a hard time squaring that with the match record and this line of output regarding LOS:

Code: Select all

Elo difference: 6.0 +/- 3.9, LOS 99.9%, DrawRatio: 38.5%
If the Elo difference is +6 +/- 3.9 and LOS 99.9%, why was H0 accepted? I'm missing something critical here and can't seem to put my finger on it. Do I have a term flipped? Shouldn't Elo0 be > Elo1? Or am I just interpreting the results the wrong way?
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Interpreting cutechess-cli SPRT + LOS output.

Post by Ajedrecista »

Hello James:

I am not an expert on this matter but I see some things that do not look right to me:
jswaff wrote: Sun Aug 23, 2020 11:44 pm[...]

A 20,000 game match is run with cutechess-cli using -sprt elo0=5 elo1=1 alpha=0.05 beta=0.05

[...]
A SPRT (Sequential Probability Ratio Test) should not have a fixed number of games (like your 20,000). A proper SPRT stops only if LLR (Log-Likelihood Ratio) goes out of the bounds, regardless the number of games have been played.

I thought that elo0 < elo1 in the common notation, but you have elo0 = 5 and elo1 = 1. It looks like something is flipped, hence your doubts with Elo, LOS and H0 accepted. I would go with elo0 < elo1 (elo0 = 1 and elo1 = 5 in your case).
jswaff wrote: Sun Aug 23, 2020 11:44 pm[...]

The match is terminated after 18,379 games, which supports the conclusion that the change is very minor. The match record:
5807 - 5491 - 7081 (0.509)

Code: Select all

SPRT: llr  -2.95  (-100.1%), lbound -2.94, ubound 2.94 - [b]H0 was accepted[/b]
I can see that the log likelihood ratio has fallen below the lower bound, triggering a stopping rule and accepting H0. In English that translates into "P1 is not stronger than P2 by at least 1 Elo point." (I think.)

I'm having a hard time squaring that with the match record and this line of output regarding LOS:

Code: Select all

Elo difference: 6.0 +/- 3.9, LOS 99.9%, DrawRatio: 38.5%
If the Elo difference is +6 +/- 3.9 and LOS 99.9%, why was H0 accepted? I'm missing something critical here and can't seem to put my finger on it. Do I have a term flipped? Shouldn't Elo0 be > Elo1? Or am I just interpreting the results the wrong way?
You answer yourself: elo0 < elo1 In fact, I am surprised that cutechess-cli accepts elo0 > elo1 if I assume that it follows the common notation. By the way, you may want to read the documentation about SPRT in cutechess-cli because I am not sure if Elo bounds are in logistic Elo (the Elo we are used to in human rankings) or Bayeselo.

Furthermore, this Elo difference is not the real difference from the test. This +6.0 ± 3.9 (with 95% confidence) is the real difference for a fixed 18,379-game match, which is not the case. Elo measures from a SPRT are done in a different way for taking into account biases from stopping rules and so. I think the right method is implemented in Fishtest. For example:

https://tests.stockfishchess.org/tests/ ... 64a10d84d5
https://tests.stockfishchess.org/html/l ... 64a10d84d5

A classical Elo computation of +1365 =9493 -1502 would give -3.85 ± 2.95 := [-6.80, -0.90] (with 95% confidence) and LOS ~ 0.52% or 0.53%. However, it gives -3.60 [-6.55, -0.63] (with 95% confidence) and LOS ~ 0.9% with the corrected method for SPRT. The correct formulas for SPRT might be in fishtest/stats (GitHub), but I am not sure.

Summary: try elo0 < elo1 and do not take seriously Elo differences and LOS computed directly from {wins, draws, loses} of SPRT.

Regards from Spain.

Ajedrecista.
jswaff
Posts: 105
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

Re: interpreting cutechess-cli SPRT + LOS output

Post by jswaff »

Hi Ajedrecista,

First, thanks very much for the reply.

Please see the below output from the cutechess-cli 'help'. Note that H1 uses the ELO0 value, where the null hypothesis H0 uses the ELO1 value.

Code: Select all

 -sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
                        Use a Sequential Probability Ratio Test as a termination
                        criterion for the match. This option should only be used
                        in matches between two players to test if engine A is
                        stronger than engine B. Hypothesis H1 is that A is
                        stronger than B by at least ELO0 ELO points, and H0
                        (the null hypothesis) is that A is not stronger than B
                        by at least ELO1 ELO points. The maximum probabilities
                        for type I and type II errors outside the interval
                        [ELO0, ELO1] are ALPHA and BETA. The match is stopped if
                        either H0 or H1 is accepted or if the maximum number of
                        games set by '-rounds' and/or '-games' is reached.
Point 1 - the last sentence is why I said the match is 20k games. If neither H0 or H1 are accepted after 20k games, the match will stop anyway at that point. I feel OK about that.

Point 2- Notice it maps ELO0 to H1, and ELO1 to H0. That's slightly confusing and I wonder if it's a typo. Going back to my hypothesis and the null hypothesis for my test:

* H1 : "P1 is stronger than P2 by at least 5 Elo points"
* H0 : "P1 is not stronger than P2 by at least 1 Elo point"

You can see how I arrive at ELO0=5 and ELO1=1.

I think it's more likely I'm just confused than cutechess has it flipped but I'm having a hard time reconciling that. Is there anything wrong with the way I've phrased my hypothesis?

Then again - https://www.chessprogramming.org/Match_Statistics#SPRT uses ELO0 for H0 and ELO1 for H1.

Hrm.

I understand your point about the Elo difference, thanks for that explanation. I will experiment with flipping the elo0 and elo1 values as you suggested but for my own sanity I need to understand it.
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Interpreting cutechess-cli SPRT + LOS output.

Post by Ajedrecista »

Hello James:
jswaff wrote: Mon Aug 24, 2020 6:25 pm Hi Ajedrecista,

First, thanks very much for the reply.

Please see the below output from the cutechess-cli 'help'. Note that H1 uses the ELO0 value, where the null hypothesis H0 uses the ELO1 value.

Code: Select all

 -sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
                        Use a Sequential Probability Ratio Test as a termination
                        criterion for the match. This option should only be used
                        in matches between two players to test if engine A is
                        stronger than engine B. Hypothesis H1 is that A is
                        stronger than B by at least ELO0 ELO points, and H0
                        (the null hypothesis) is that A is not stronger than B
                        by at least ELO1 ELO points. The maximum probabilities
                        for type I and type II errors outside the interval
                        [ELO0, ELO1] are ALPHA and BETA. The match is stopped if
                        either H0 or H1 is accepted or if the maximum number of
                        games set by '-rounds' and/or '-games' is reached.
Point 1 - the last sentence is why I said the match is 20k games. If neither H0 or H1 are accepted after 20k games, the match will stop anyway at that point. I feel OK about that.

Point 2- Notice it maps ELO0 to H1, and ELO1 to H0. That's slightly confusing and I wonder if it's a typo. Going back to my hypothesis and the null hypothesis for my test:

* H1 : "P1 is stronger than P2 by at least 5 Elo points"
* H0 : "P1 is not stronger than P2 by at least 1 Elo point"

You can see how I arrive at ELO0=5 and ELO1=1.

I think it's more likely I'm just confused than cutechess has it flipped but I'm having a hard time reconciling that. Is there anything wrong with the way I've phrased my hypothesis?

Then again - https://www.chessprogramming.org/Match_Statistics#SPRT uses ELO0 for H0 and ELO1 for H1.

Hrm.

I understand your point about the Elo difference, thanks for that explanation. I will experiment with flipping the elo0 and elo1 values as you suggested but for my own sanity I need to understand it.
I see the SPRT help is exactly what you copied and pasted:

https://github.com/cutechess/cutechess/ ... c/help.txt

I found the sprt.cpp file on GitHub:

https://github.com/cutechess/cutechess/ ... c/sprt.cpp

I took a look at the code and I found the formulas identical to Michel's ones (Michel was the person that firstly/mainly developped SPRT formulas for Fishtest and he is the author of the script posted at 'Match Statistics' article at CPW). I did ports of Michel's Python scripts to Fortran 95 for my own use (LLR calculator, SPRT simulator...):

Code: Select all

! MORE CODE.

read(10,*) alpha  ! Maximum value of type I error (reached at bayeselo = bayeselo_0).
read(10,*) beta  ! Maximum value of type II error for bayeselo >= bayeselo_1 (reached at bayeselo = bayeselo_1).

! MORE CODE.

games = wins + draws + loses

W = (wins + 0d0)/(games + 0d0)
D = (draws + 0d0)/(games + 0d0)
L = (loses + 0d0)/(games + 0d0)

bayeselo = 2d2*log10(W*(1d0 - L)/(L*(1d0 - W)))
drawelo = 2d2*log10((1d0 - L)*(1d0 - W)/(L*W))  ! Estimate of drawelo from the sample.

lower_bound = log(beta/(1d0 - alpha))
upper_bound = log((1d0 - beta)/alpha)

P0_W = 1d0/(1d0 + 1d1**(2.5d-3*(drawelo - bayeselo_0)))
P0_L = 1d0/(1d0 + 1d1**(2.5d-3*(drawelo + bayeselo_0)))
P0_D = 1d0 - P0_W - P0_L

P1_W = 1d0/(1d0 + 1d1**(2.5d-3*(drawelo - bayeselo_1)))
P1_L = 1d0/(1d0 + 1d1**(2.5d-3*(drawelo + bayeselo_1)))
P1_D = 1d0 - P1_W - P1_L

LLR_wins = wins*log(P1_W/P0_W)
LLR_draws = draws*log(P1_D/P0_D)
LLR_loses = loses*log(P1_L/P0_L)
LLR = LLR_wins + LLR_draws + LLR_loses

! MORE CODE.
Which looks equal to cutechess-cli except that I used Bayeselo bounds directly and cutechess-cli uses a scale factor for logistic Elo <==> Bayeselo:

Code: Select all

// MORE CODE.

Sprt::Status Sprt::status() const
{
	Status status = {
		Continue,
		0.0,
		0.0,
		0.0
	};

	if (m_wins <= 0 || m_losses <= 0 || m_draws <= 0)
		return status;

	// Estimate draw_elo out of sample
	const SprtProbability p(m_wins, m_losses, m_draws);
	const BayesElo b(p);

	// Probability laws under H0 and H1
	const double s = b.scale();
	const BayesElo b0(m_elo0 / s, b.drawElo());
	const BayesElo b1(m_elo1 / s, b.drawElo());
	const SprtProbability p0(b0), p1(b1);

	// Log-Likelyhood Ratio
	status.llr = m_wins * std::log(p1.pWin() / p0.pWin()) +
		     m_losses * std::log(p1.pLoss() / p0.pLoss()) +
		     m_draws * std::log(p1.pDraw() / p0.pDraw());

	// Bounds based on error levels of the test
	status.lBound = std::log(m_beta / (1.0 - m_alpha));
	status.uBound = std::log((1.0 - m_beta) / m_alpha);

	if (status.llr > status.uBound)
		status.result = AcceptH1;
	else if (status.llr < status.lBound)
		status.result = AcceptH0;

	return status;
}

// MORE CODE.
Please note that the bounds are the same in cutechess-cli implentation and mine:

Code: Select all

lower_bound = ln[beta/(1 - alpha)]
upper_bound = ln[(1 - beta)/alpha]
So the roles of alpha, beta, H0 and H1 should be the same. Could it be a typo in the help file after all?

Anyway, there are some people here that can answer your questions better than me. Good luck!

Regards from Spain.

Ajedrecista.