Help Understanding SPRT Results

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

YumYum
Posts: 7
Joined: Sun Sep 08, 2024 6:23 pm
Full name: Simon Savage

Help Understanding SPRT Results

Post by YumYum »

Hi Folks,

I'm currently developing an engine "YumYum" in C#. You'll see it from time-to-time as a Lichess Bot.

I've read a number of interesting posts here about using SPRT to judge whether changes have improved the strength of an engine and decided to give it a go using CuteChess-CLI (version 1.3.1).

I ran a tournament between two versions of the same engine which have very slightly different move ordering. So I think:
H0 = there is no difference in the performance of the two engines
H1 = would be there is a difference in the performance of the two engines.

I'm not sure I fully understand the test results though and was hoping that someone more experienced could help clarify them for me please.

The following was used to start the tournament.

Code: Select all

cutechess-cli.exe -engine name=New cmd="Chess_UCI_v2.exe" proto=uci -engine name=checks cmd="Chess_UCI_v2.exe" 
proto=uci option."Order on checks"=true -each tc=10+1 -concurrency 7 -sprt elo0=0 elo1=10 alpha=0.05 beta=0.05 -rounds 5000 
-openings file="8moves_v3.pgn" format=pgn
and gave the following output

Code: Select all

Score of New vs checks: 364 - 346 - 218  [0.510] 928
...      New playing White: 181 - 167 - 116  [0.515] 464
...      New playing Black: 183 - 179 - 102  [0.504] 464
...      White vs Black: 360 - 350 - 218  [0.505] 928
Elo difference: 6.7 +/- 19.5, LOS: 75.0 %, DrawRatio: 23.5 %
SPRT: llr 0.174 (5.9%), lbound -2.94, ubound 2.94

My understanding is:
-that 928 games were complete out of the maximum of 5000 specified which indicates it played enough to reach a decision.

-the ELO difference of 6.7 +/- 19.5 suggests that the first engine ("New") is between 12.8 ELO worse and 26.2 ELO better than the engine it was being tested against ("checks"). The variance does seem quite large in comparison to the ELO difference though. Is there any way to reduce this?

-the LOS of 75% means that the first engine is likely to be better than the second engine. However, I'm not sure what the 75% actually refers too. Clearly it doesn't mean that the first engine is likely to win 75% of matches against the second one.

- The llr is some form of measure of whether the engine is outside the specified ELO. I have no idea though what the specific numbers are telling me.

Also, I note that the output did not give me a judgment on H0 or H1. Is that to be expected? A number of other posts I've seen included things like "H0 was accepted" in the SPRT output.


As you may have gathered, I'm not a statistician so any thoughts you can offer are most appreciated.

Kind wishes

Simon
JacquesRW
Posts: 103
Joined: Sat Jul 30, 2022 12:12 pm
Full name: Jamie Whiting

Re: Help Understanding SPRT Results

Post by JacquesRW »

YumYum wrote: Sun Sep 08, 2024 7:20 pm - The llr is some form of measure of whether the engine is outside the specified ELO. I have no idea though what the specific numbers are telling me.

Also, I note that the output did not give me a judgment on H0 or H1. Is that to be expected? A number of other posts I've seen included things like "H0 was accepted" in the SPRT output.
Simon
The SPRT did not finish, so something else caused the match to end, e.g accidental ctrl+c or a crash occurred (should be able to be seen in the preceeding game results to the final output).

You can see this line:

Code: Select all

SPRT: llr 0.174 (5.9%), lbound -2.94, ubound 2.94
This means that the SPRT will fail (H0 accepted) if llr reaches -2.94, and pass (H1 accepted) if +2.94 is reached.
YumYum
Posts: 7
Joined: Sun Sep 08, 2024 6:23 pm
Full name: Simon Savage

Re: Help Understanding SPRT Results

Post by YumYum »

JacquesRW wrote: Sun Sep 08, 2024 7:38 pm The SPRT did not finish, so something else caused the match to end, e.g accidental ctrl+c or a crash occurred (should be able to be seen in the preceeding game results to the final output).
How interesting Jaques- I'll need to investigate what is happening there.

Thanks for the details about LLR

Simon
YumYum
Posts: 7
Joined: Sun Sep 08, 2024 6:23 pm
Full name: Simon Savage

Re: Help Understanding SPRT Results

Post by YumYum »

Just a quick follow up to this.

Jacques was absolutely right - on very rare occasions the engine was disconnecting or abandoning a game.

Interestingly, recreating the games in my engine (using UCI) didn't throw up any issues.

Anyway, adding the "-recover" flag to my cutechess-cli script has solved the problem and I got this output...

Code: Select all

Score of New vs checks: 1024 - 907 - 483  [0.524] 2414
...      New playing White: 503 - 453 - 251  [0.521] 1207
...      New playing Black: 521 - 454 - 232  [0.528] 1207
...      White vs Black: 957 - 974 - 483  [0.496] 2414
Elo difference: 16.9 +/- 12.4, LOS: 99.6 %, DrawRatio: 20.0 %
SPRT: llr 2.96 (100.6%), lbound -2.94, ubound 2.94 - H1 was accepted
Which indicates that the version of my engine which does not check for checks is slightly superior. I suspect that this is due to the time overhead performing those calculations.

Would someone be able to explain what the LOS: 99.6% means in these results though? Is it that I can have a 99.6% confidence in these results?

Kind regards

Simon
gaard
Posts: 460
Joined: Mon Jun 07, 2010 3:13 am
Location: Holland, MI
Full name: Martin W

Re: Help Understanding SPRT Results

Post by gaard »

YumYum wrote: Mon Sep 09, 2024 12:29 pm Just a quick follow up to this.

Jacques was absolutely right - on very rare occasions the engine was disconnecting or abandoning a game.

Interestingly, recreating the games in my engine (using UCI) didn't throw up any issues.

Anyway, adding the "-recover" flag to my cutechess-cli script has solved the problem and I got this output...

Code: Select all

Score of New vs checks: 1024 - 907 - 483  [0.524] 2414
...      New playing White: 503 - 453 - 251  [0.521] 1207
...      New playing Black: 521 - 454 - 232  [0.528] 1207
...      White vs Black: 957 - 974 - 483  [0.496] 2414
Elo difference: 16.9 +/- 12.4, LOS: 99.6 %, DrawRatio: 20.0 %
SPRT: llr 2.96 (100.6%), lbound -2.94, ubound 2.94 - H1 was accepted
Which indicates that the version of my engine which does not check for checks is slightly superior. I suspect that this is due to the time overhead performing those calculations.

Would someone be able to explain what the LOS: 99.6% means in these results though? Is it that I can have a 99.6% confidence in these results?

Kind regards

Simon
LOS is the cumulative probability for the number of wins being <= 1024 in this case, assuming an even distribution of wins and losses. LOS and Elo aren't as useful in SPRT as they are in fixed number of games. Your alpha/beta SPRT bounds would determine the confidence of this test. Not sure what cutechess-cli defaults to, maybe 95%.
YumYum
Posts: 7
Joined: Sun Sep 08, 2024 6:23 pm
Full name: Simon Savage

Re: Help Understanding SPRT Results

Post by YumYum »

gaard wrote: Mon Sep 09, 2024 3:16 pm LOS is the cumulative probability for the number of wins being <= 1024 in this case, assuming an even distribution of wins and losses. LOS and Elo aren't as useful in SPRT as they are in fixed number of games. Your alpha/beta SPRT bounds would determine the confidence of this test. Not sure what cutechess-cli defaults to, maybe 95%.
Thank you that helps.

S