P1 = newer version , with small change expected to a nominal improvement at best. Perhaps a small loss.
P2 = best known version
A 20,000 game match is run with cutechess-cli using -sprt elo0=5 elo1=1 alpha=0.05 beta=0.05
Meaning:
* H1 : "P1 is stronger than P2 by at least 5 Elo points"
* H0 : "P1 is not stronger than P2 by at least 1 Elo point"
* There is a 5% chance of a Type 1 error and a 5% chance of a Type II error. IOW, there is a 5% chance of rejecting a change worth more than 5 Elo and a 5% chance of accepting a change with less than 1 Elo.
The match is terminated after 18,379 games, which supports the conclusion that the change is very minor. The match record:
5807 - 5491 - 7081 (0.509)
Code: Select all
SPRT: llr -2.95 (-100.1%), lbound -2.94, ubound 2.94 - [b]H0 was accepted[/b]
I'm having a hard time squaring that with the match record and this line of output regarding LOS:
Code: Select all
Elo difference: 6.0 +/- 3.9, LOS 99.9%, DrawRatio: 38.5%