algerbrex wrote: ↑Mon Nov 08, 2021 11:09 pm
Hmm, good points. I'll make a note of them, thanks. I'd still be curious to better understand what cutechess
is doing if it isn't standard SPRT testing.
-sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
Use a Sequential Probability Ratio Test as a termination
criterion for the match. This option should only be used
in matches between two players to test if engine A is
stronger than engine B. Hypothesis H1 is that A is
stronger than B by at least ELO0 ELO points, and H0
(the null hypothesis) is that A is not stronger than B
by at least ELO1 ELO points. The maximum probabilities
for type I and type II errors outside the interval
[ELO0, ELO1] are ALPHA and BETA. The match is stopped if
either H0 or H1 is accepted or if the maximum number of
games set by '-rounds' and/or '-games' is reached.
According to CuteChess' help, it connects H1 to elo0, which, according to various sources, is incorrect compared to what it internally does. (I have not checked this myself in the code.) Also, the description above is unclear. You normally wouldn't even have elo0 and elo1 as two parameters.
Normally, you state:
- H1: Engine NEW is at least 10 Elo stronger than OLD.
Then H0 automatically becomes:
- H0: Engine New is NOT at least 10 Elo stronger than engine OLD.
So you only have one parameter. So even if the new engine is 7 Elo stronger than the old engine, the test would still fail. (Because it is NOT at least 10 Elo stronger.)
What CuteChess actually _seems_ to be doing is testing against H1 within an elo0 and elo1 range, where elo0 and elo1 define the H1 hypothesis, but only the "elo0" part is negated:
- elo0: 0
- elo1: 10
So this means:
H1: Engine NEW is at least 0 Elo stronger than engine OLD, AND outside a margin of 10 Elo.
H0 Engine NEW is NOT at least 0 Elo stronger than engine OLD, AND outside a margin of 10 Elo.
So if NEW is 15 Elo stronger, then H1 is true: It's at least 0 Elo stronger, and it's outside a margin of 10 Elo.
If NEW is 15 Elo weaker (or "-15 stronger"), then H0 is true: It's NOT at least 0 Elo stronger (it's 15 weaker), AND it's outside a margin of 10 Elo.
This means that Cutechess will keep testing between a -10 and +10 margin.
So yes, it is an SPRT test as far as I can see (as H1 can be as complex as you want, and H0 then automatically becomes "not H1"), but it's not very well described in the help. It may actually be described WRONG in the help, but it's so... unclear to me that I can't even determine if this is true.
Therefore I just tested what Cutechess actually does, and that is what I described on that page.
This also makes it possible to set this:
- Elo0: -5
- Elo1: 7
H1: Engine NEW is expected to be at least -5 Elo against OLD, AND outside a 12 Elo margin
H0: Engine New is NOT expected to be at least -5 Elo stronger than OLD, AND outside a 12 elo margin
Thus Cutechess would keep testing if the NEW engine is between -5 and +7. If the engine is +15 Elo, then it is "at least -5 Elo stronger", and it's outside the 12 Elo margin, so CuteChess accepts H1. If the engine is -8 Elo, it is NOT at least -5 Elo stronger (because it's now -8) AND it's outside the 12 Elo margin (+7 - 8 = -5, and -8 is outside it).
This feels very logical, but if the help is intending to describe this, then it does a poor job at it.
PS: If someone can definitively prove me wrong AND clearly explains what cutechess does, I'll gladly change that page again, and give proper credits.