cutechess-cli SPRT: What do elo0 and elo1 parameters actually mean?

mvanthoor · Post by **mvanthoor** » Mon Sep 27, 2021 5:53 pm

Hi

I'd like to clear up a question I've been having for some time, with regard to cutechess-cli's SPRT parameter. When searching for an answer, I find old posts, back to 2015, across the internet which ask this same question, but which never get a definitive answer. When you search google for "CuteChess SPRT Testing", my very own site (!) comes up as the first hit, so now I'm determined to get this information correct once and for all.

I hope to be able to clear this up with your help.

This is the SPRT-parameter I mentioned on the site:

Code: Select all

-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05

This makes cutechess run an SPRT-test, with two hypotheses: H0 (fail) and H1 (succeed).

This is how I understand the parameters:
alpha : cutechess will accept "fail" in 5% of tests, while the test actually wasn't fail.
beta : cutechess will accept "success" in 5% of tests, while the test actually wasn't a success.

Because both percentages are the same, we can say: "cutechess-cli accepts the wrong hypothesis in 5% of tests."

Now for elo0 and elo1, where the confusion lies. Some time ago I got a message telling me that there is a misconception on my site about elo0 and elo1, with an explanation why my version was wrong. I tried to wrap my head around this, but I can't; the explanation on my site comes directly from the cutechess-cli help. So the help is faulty, or I explained it incorrectly. The help says:

Code: Select all

-sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
                        Use a Sequential Probability Ratio Test as a termination
                        criterion for the match. This option should only be used
                        in matches between two players to test if engine A is
                        stronger than engine B. Hypothesis H1 is that A is
                        stronger than B by at least ELO0 ELO points, and H0
                        (the null hypothesis) is that A is not stronger than B
                        by at least ELO1 ELO points. The maximum probabilities
                        for type I and type II errors outside the interval
                        [ELO0, ELO1] are ALPHA and BETA. The match is stopped if
                        either H0 or H1 is accepted or if the maximum number of
                        games set by '-rounds' and/or '-games' is reached.

So I said on my site, derived from the help above (and the SPRT parameter I gave earlier):

[ocde]
H1: Engine NEW is at lesat 1 Elo stronger than engine OLD.
H0: Engine NEW is NOT more than 5 Elo stronger than Engine OLD.
Error margin: 5%.
[/code]

According to the help, elo0 belongs to h1, and elo1 belongs to h0.

So, this eems to be correct:
- If an new engine is 20 elo stronger than the old one, then: "is at least 1 elo stronger" (h1) is true, and "is NOT more than 5 elo stronger (h0) is false.
- If an engine is 10 elo weaker (so -10), then "is at least 1 elo stronger" (h1) is false, and "is not 5 elo stronger" (h0) is true.

The message I received said that this interpretation is incorrect, and elo0 belongs to h0, and elo1 belongs to h1. It also states that elo (1, 5) is very precise, making the test too long, and I should consider (0, 20).

I have looked around and found one interpretation where elo0 belongs to h0 and elo1 belongs to h1 that seemed logical:

- elo0 / h0: accepted if the new engine is not at least elo0 stronger than the old one.
- elo1 / h1: accepted if the new engine is at least elo1 stronger than the old one.
- In between those values, the test keeps running.

Seems logical...

If you put in elo0 = 0, and elo1 = 20, it would mean:
- The test fails (h0 accepted) if the new engine is less than 0 elo stronger than the new one. (i.e.: it is weaker)
- The test succeeds as soon as it's determined that the new engine is at least 20 elo stronger.
- In between 0 and 20, we keep running the test.

However, I have determined that, if you make the gap bigger, such as (0, 500), the test ends extremely quickly. That doesn't jive with the new interpretation I just gave:

- The test fails if the new engine is less than 0 Elo stronger than the old engine.
- The test succeeds if the new engine is at least 500 elo stronger than the old one.
- In between, we keep the test running.

Well, the chance that a new version of the engine is stronger than the old one, by say 30 Elo, is big... but the chance that it is 500 points or more stronger is almost non-existent. So, the test would sit somewhere in-between 0 and 500, and so it would need to take a VERY long time to finish. (It would probably be never able to decide, and terminate at the set maximum number of games.) However, empirical testing has shown that making the gap between elo0 and elo1 bigger, makes the test finish faster, with bigger error margins.

Both of the interpretations that makes sense to me could be incorrect. If my first one is incorrect, then either the CuteChess help is incorrect (and has been since at least 2015), or I'm interpreting it in the wrong way. The second interpretation has been found incorrect by empirical testing: elo (0,500) makes the test finish faster, not slower as one would expect if you have a bigger margin to sit in.

Can someone definitively clear this up, so I can update the information on my SPRT page?

Ajedrecista · Post by **Ajedrecista** » Mon Sep 27, 2021 8:17 pm

Hello Marcel:

mvanthoor wrote: ↑Mon Sep 27, 2021 5:53 pm[...]

Now for elo0 and elo1, where the confusion lies. Some time ago I got a message telling me that there is a misconception on my site about elo0 and elo1, with an explanation why my version was wrong. I tried to wrap my head around this, but I can't; the explanation on my site comes directly from the cutechess-cli help. So the help is faulty, or I explained it incorrectly. The help says:
Code: Select all
-sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
                        Use a Sequential Probability Ratio Test as a termination
                        criterion for the match. This option should only be used
                        in matches between two players to test if engine A is
                        stronger than engine B. Hypothesis H1 is that A is
                        stronger than B by at least ELO0 ELO points, and H0
                        (the null hypothesis) is that A is not stronger than B
                        by at least ELO1 ELO points. The maximum probabilities
                        for type I and type II errors outside the interval
                        [ELO0, ELO1] are ALPHA and BETA. The match is stopped if
                        either H0 or H1 is accepted or if the maximum number of
                        games set by '-rounds' and/or '-games' is reached.
So I said on my site, derived from the help above (and the SPRT parameter I gave earlier):
Code: Select all
    H1: Engine NEW is at lesat 1 Elo stronger than engine OLD.
    H0: Engine NEW is NOT more than 5 Elo stronger than Engine OLD.
    Error margin: 5%.
According to the help, elo0 belongs to h1, and elo1 belongs to h0.

So, this seems to be correct:
- If an new engine is 20 elo stronger than the old one, then: "is at least 1 elo stronger" (h1) is true, and "is NOT more than 5 elo stronger (h0) is false.
- If an engine is 10 elo weaker (so -10), then "is at least 1 elo stronger" (h1) is false, and "is not 5 elo stronger" (h0) is true.

The message I received said that this interpretation is incorrect, and elo0 belongs to h0, and elo1 belongs to h1. It also states that elo (1, 5) is very precise, making the test too long, and I should consider (0, 20).

I have looked around and found one interpretation where elo0 belongs to h0 and elo1 belongs to h1 that seemed logical:

- elo0 / h0: accepted if the new engine is not at least elo0 stronger than the old one.
- elo1 / h1: accepted if the new engine is at least elo1 stronger than the old one.
- In between those values, the test keeps running.

Seems logical...

If you put in elo0 = 0, and elo1 = 20, it would mean:
- The test fails (h0 accepted) if the new engine is less than 0 elo stronger than the new one. (i.e.: it is weaker)
- The test succeeds as soon as it's determined that the new engine is at least 20 elo stronger.
- In between 0 and 20, we keep running the test.

However, I have determined that, if you make the gap bigger, such as (0, 500), the test ends extremely quickly. That doesn't give with the new interpretation I just gave:

- The test fails if the new engine is less than 0 Elo stronger than the old engine.
- The test succeeds if the new engine is at least 500 elo stronger than the old one.
- In between, we keep the test running.

Well, the chance that a new version of the engine is stronger than the old one, by say 30 Elo, is big... but the chance that it is 500 points or more stronger is almost non-existent. So, the test would sit somewhere in-between 0 and 500, and so it would need to take a VERY long time to finish. (It would probably be never able to decide, and terminate at the set maximum number of games.) However, empirical testing has shown that making the gap between elo0 and elo1 bigger, makes the test finish faster, with bigger error margins.

Both of the interpretations that makes sense to me could be incorrect. If my first one is incorrect, then either the CuteChess help is incorrect (and has been since at least 2015), or I'm interpreting it in the wrong way. The second interpretation has been found incorrect by empirical testing: elo (0,500) makes the test finish faster, not slower as one would expect if you have a bigger margin to sit in.

Can someone definitively clear this up, so I can update the information on my SPRT page?

I am not an expert in this matter despite I was active in some old threads at TalkChess. I suggest to take a look at CPW article on match statistics because it has a list of threads and some of them could be of interest for you. There is much stuff for reading, though... There are some scaling differences between usual Elo (logistic Elo in many of those threads) and Bayeselo in SPRT bounds for legacy reasons, since Bayeselo was the first standard implemented at SF Testing Network, then moved to logistic Elo after few years.

Regarding your question about gap of elo1 - elo0 and duration of the test, there is a formula that supports what you have seen:

SPRT and narrowing of (elo1 - elo0) difference.

<Expected number of games> ~ Constant/[(elo1 - elo0)²] basically.

Other threads that might be useful are:

sprt and margin of error
http://www.talkchess.com/forum3/viewtopic.php?t=54359
Getting SPRT right
A question about SPRT

As well as some scripts with source codes that show how SPRT calculations are done, possibly giving you a hint on what is going on. For example, the one saved at CPW:

Code: Select all

[...]

This function computes the log likelihood ratio of H0:elo_diff=elo0 versus
H1:elo_diff=elo1 under the logistic elo model

[...]

Please remember the list of topics at CPW article mentioned above. More expert answers are welcome.

Regards from Spain.

Ajedrecista.

mvanthoor · Post by **mvanthoor** » Mon Sep 27, 2021 10:24 pm

Ajedrecista wrote: ↑Mon Sep 27, 2021 8:17 pm I am not an expert in this matter despite I was active in some old threads at TalkChess. I suggest to take a look at CPW article on match statistics because it has a list of threads and some of them could be of interest for you...

Thanks for your answer, but I have read all, if not most of those threads already. If there was a definitive answer in there (maybe there is but I fail to understand it), I wouldn't even have posted the question.

I have come up with a third possibility for interpreting elo0 and elo1, after watching [url= video about hypothesis testing.

The one thing that keeps coming back is:

- H0 = some statement (the currently believed status quo)
- H1 = H0 is not true
- We need a margin to determine at which point we stop testing, and either accept H0 or H1.

So I thought that this might be correct for Cutechess:

- H0: The new engine is the same strength as the old engine.
- H1: H0 is not true: the new engine is either stronger, or weaker.
- Margin: stop the test as soon as the result is within 10 Elo.
- Error level: 5%

Therefore:
elo0 = 0 (H0: there is no difference in strength, so 0 Elo)
elo1 = 10 (the margin)
alpha, beta: 0.05 (5% error either way, so we want 95% confidence that our margin is below 10 and stays there)

I made a tiny change to Rustic, which I wanted to test, but suspected that it would not make a difference. (I initialized a vector at capacity 32, instead of leaving it empty, trying to see if this would negate some vector resizing and thus speed up the engine.)

This was the SPRT-command, with the above parameters:

Code: Select all

cutechess-cli \
-engine conf="Rustic Alpha 3.18.100" \
-engine conf="Rustic Alpha 3.17.100" \
-each \
    tc=inf/10+0.1 \
    book="/home/marcel/Chess/OpeningBooks/gm1950.bin" \
    bookdepth=4 \
-games 2 -rounds 2500 -repeat 2 -maxmoves 200 \
-sprt elo0=0 elo1=10 alpha=0.05 beta=0.05 \
-concurrency 4 \
-ratinginterval 10 \
-pgnout "/home/marcel/Chess/sprt.pgn"

This was the result:

Code: Select all

Score of Rustic Alpha 3.18.100 vs Rustic Alpha 3.17.100: 1447 - 1461 - 1181  [0.498] 4089
...      Rustic Alpha 3.18.100 playing White: 818 - 630 - 597  [0.546] 2045
...      Rustic Alpha 3.18.100 playing Black: 629 - 831 - 584  [0.451] 2044
...      White vs Black: 1649 - 1259 - 1181  [0.548] 4089
Elo difference: -1.2 +/- 9.0, LOS: 39.8 %, DrawRatio: 28.9 %
SPRT: llr -2.95 (-100.1%), lbound -2.94, ubound 2.94 - H0 was accepted

H0 ("There is no difference between the engines", as stated by elo0 = 0) was accepted. The margin is below 10 (as stated by elo1), and -1.2 is between -10 and 10.

I'll need to run some more tests to be certain, but after this outcome I'm leaning towards the following explanation:

elo0: The H0 hypothesis. ("Old engine is X elo different from new engine.") H1 automatically is: "That's not true." (So the difference is either more, or less.)
elo1: We are going to test for H0, and stop as soon as the margin is "elo0".
alpha / beta: But we must be in the alpha-beta error range (if 0.05, then confidence will be 95%).

You could say: I made a patch, and in a rough test of 100 games, I believe the new engine to be 20 elo stronger than the old one. So, the hypothesis H0 is: "My old engine is 20 Elo weaker than the new one." Thus:

elo0 = -20
elo1 = 5 (margin 5 Elo either way)
alpha/beta: 0.05 / 0.05. (confidence level in the result: 95%)

Testing then falls between -25 and -15. If the result is -17, then H0 is accepted: the old engine is indeed 20 elo weaker than the new one within a margin of 5 Elo. If the result is -12 Elo, then the old engine is between -17 and -7 Elo weaker. H1 is accepted, because "The old engine is 20 Elo weaker" is not true. (And thus, the new engine is not 20 Elo stronger.)

I feel that this interpreation of "elo0 = Null hypothesis" and "Elo1 = margin" is correct / could be correct, but I'm not sure. It feels logical, though.

cutechess-cli SPRT: What do elo0 and elo1 parameters actually mean?

cutechess-cli SPRT: What do elo0 and elo1 parameters actually mean?

Re: cutechess-cli SPRT: what do elo0 and elo1 parameters actually mean?

Re: cutechess-cli SPRT: what do elo0 and elo1 parameters actually mean?