
I'd like to clear up a question I've been having for some time, with regard to cutechess-cli's SPRT parameter. When searching for an answer, I find old posts, back to 2015, across the internet which ask this same question, but which never get a definitive answer. When you search google for "CuteChess SPRT Testing", my very own site (!) comes up as the first hit, so now I'm determined to get this information correct once and for all.
I hope to be able to clear this up with your help.
This is the SPRT-parameter I mentioned on the site:
Code: Select all
-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05
This is how I understand the parameters:
alpha : cutechess will accept "fail" in 5% of tests, while the test actually wasn't fail.
beta : cutechess will accept "success" in 5% of tests, while the test actually wasn't a success.
Because both percentages are the same, we can say: "cutechess-cli accepts the wrong hypothesis in 5% of tests."
Now for elo0 and elo1, where the confusion lies. Some time ago I got a message telling me that there is a misconception on my site about elo0 and elo1, with an explanation why my version was wrong. I tried to wrap my head around this, but I can't; the explanation on my site comes directly from the cutechess-cli help. So the help is faulty, or I explained it incorrectly. The help says:
Code: Select all
-sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
Use a Sequential Probability Ratio Test as a termination
criterion for the match. This option should only be used
in matches between two players to test if engine A is
stronger than engine B. Hypothesis H1 is that A is
stronger than B by at least ELO0 ELO points, and H0
(the null hypothesis) is that A is not stronger than B
by at least ELO1 ELO points. The maximum probabilities
for type I and type II errors outside the interval
[ELO0, ELO1] are ALPHA and BETA. The match is stopped if
either H0 or H1 is accepted or if the maximum number of
games set by '-rounds' and/or '-games' is reached.
[ocde]
H1: Engine NEW is at lesat 1 Elo stronger than engine OLD.
H0: Engine NEW is NOT more than 5 Elo stronger than Engine OLD.
Error margin: 5%.
[/code]
According to the help, elo0 belongs to h1, and elo1 belongs to h0.
So, this eems to be correct:
- If an new engine is 20 elo stronger than the old one, then: "is at least 1 elo stronger" (h1) is true, and "is NOT more than 5 elo stronger (h0) is false.
- If an engine is 10 elo weaker (so -10), then "is at least 1 elo stronger" (h1) is false, and "is not 5 elo stronger" (h0) is true.
The message I received said that this interpretation is incorrect, and elo0 belongs to h0, and elo1 belongs to h1. It also states that elo (1, 5) is very precise, making the test too long, and I should consider (0, 20).
I have looked around and found one interpretation where elo0 belongs to h0 and elo1 belongs to h1 that seemed logical:
- elo0 / h0: accepted if the new engine is not at least elo0 stronger than the old one.
- elo1 / h1: accepted if the new engine is at least elo1 stronger than the old one.
- In between those values, the test keeps running.
Seems logical...
If you put in elo0 = 0, and elo1 = 20, it would mean:
- The test fails (h0 accepted) if the new engine is less than 0 elo stronger than the new one. (i.e.: it is weaker)
- The test succeeds as soon as it's determined that the new engine is at least 20 elo stronger.
- In between 0 and 20, we keep running the test.
However, I have determined that, if you make the gap bigger, such as (0, 500), the test ends extremely quickly. That doesn't jive with the new interpretation I just gave:
- The test fails if the new engine is less than 0 Elo stronger than the old engine.
- The test succeeds if the new engine is at least 500 elo stronger than the old one.
- In between, we keep the test running.
Well, the chance that a new version of the engine is stronger than the old one, by say 30 Elo, is big... but the chance that it is 500 points or more stronger is almost non-existent. So, the test would sit somewhere in-between 0 and 500, and so it would need to take a VERY long time to finish. (It would probably be never able to decide, and terminate at the set maximum number of games.) However, empirical testing has shown that making the gap between elo0 and elo1 bigger, makes the test finish faster, with bigger error margins.
Both of the interpretations that makes sense to me could be incorrect. If my first one is incorrect, then either the CuteChess help is incorrect (and has been since at least 2015), or I'm interpreting it in the wrong way. The second interpretation has been found incorrect by empirical testing: elo (0,500) makes the test finish faster, not slower as one would expect if you have a bigger margin to sit in.
Can someone definitively clear this up, so I can update the information on my SPRT page?