Help Understanding SPRT Results

YumYum · Post by **YumYum** » Sun Sep 08, 2024 7:20 pm

Hi Folks,

I'm currently developing an engine "YumYum" in C#. You'll see it from time-to-time as a Lichess Bot.

I've read a number of interesting posts here about using SPRT to judge whether changes have improved the strength of an engine and decided to give it a go using CuteChess-CLI (version 1.3.1).

I ran a tournament between two versions of the same engine which have very slightly different move ordering. So I think:
H0 = there is no difference in the performance of the two engines
H1 = would be there is a difference in the performance of the two engines.

I'm not sure I fully understand the test results though and was hoping that someone more experienced could help clarify them for me please.

The following was used to start the tournament.

Code: Select all

cutechess-cli.exe -engine name=New cmd="Chess_UCI_v2.exe" proto=uci -engine name=checks cmd="Chess_UCI_v2.exe" 
proto=uci option."Order on checks"=true -each tc=10+1 -concurrency 7 -sprt elo0=0 elo1=10 alpha=0.05 beta=0.05 -rounds 5000 
-openings file="8moves_v3.pgn" format=pgn

and gave the following output

Code: Select all

Score of New vs checks: 364 - 346 - 218  [0.510] 928
...      New playing White: 181 - 167 - 116  [0.515] 464
...      New playing Black: 183 - 179 - 102  [0.504] 464
...      White vs Black: 360 - 350 - 218  [0.505] 928
Elo difference: 6.7 +/- 19.5, LOS: 75.0 %, DrawRatio: 23.5 %
SPRT: llr 0.174 (5.9%), lbound -2.94, ubound 2.94

My understanding is:
-that 928 games were complete out of the maximum of 5000 specified which indicates it played enough to reach a decision.

-the ELO difference of 6.7 +/- 19.5 suggests that the first engine ("New") is between 12.8 ELO worse and 26.2 ELO better than the engine it was being tested against ("checks"). The variance does seem quite large in comparison to the ELO difference though. Is there any way to reduce this?

-the LOS of 75% means that the first engine is likely to be better than the second engine. However, I'm not sure what the 75% actually refers too. Clearly it doesn't mean that the first engine is likely to win 75% of matches against the second one.

- The llr is some form of measure of whether the engine is outside the specified ELO. I have no idea though what the specific numbers are telling me.

Also, I note that the output did not give me a judgment on H0 or H1. Is that to be expected? A number of other posts I've seen included things like "H0 was accepted" in the SPRT output.

As you may have gathered, I'm not a statistician so any thoughts you can offer are most appreciated.

Kind wishes

Simon

JacquesRW · Post by **JacquesRW** » Sun Sep 08, 2024 7:38 pm

YumYum wrote: ↑Sun Sep 08, 2024 7:20 pm - The llr is some form of measure of whether the engine is outside the specified ELO. I have no idea though what the specific numbers are telling me.

Also, I note that the output did not give me a judgment on H0 or H1. Is that to be expected? A number of other posts I've seen included things like "H0 was accepted" in the SPRT output.
Simon

The SPRT did not finish, so something else caused the match to end, e.g accidental ctrl+c or a crash occurred (should be able to be seen in the preceeding game results to the final output).

You can see this line:

Code: Select all

SPRT: llr 0.174 (5.9%), lbound -2.94, ubound 2.94

This means that the SPRT will fail (H0 accepted) if llr reaches -2.94, and pass (H1 accepted) if +2.94 is reached.

YumYum · Post by **YumYum** » Sun Sep 08, 2024 7:46 pm

JacquesRW wrote: ↑Sun Sep 08, 2024 7:38 pm The SPRT did not finish, so something else caused the match to end, e.g accidental ctrl+c or a crash occurred (should be able to be seen in the preceeding game results to the final output).

How interesting Jaques- I'll need to investigate what is happening there.

Thanks for the details about LLR

Simon

YumYum · Post by **YumYum** » Mon Sep 09, 2024 12:29 pm

Just a quick follow up to this.

Jacques was absolutely right - on very rare occasions the engine was disconnecting or abandoning a game.

Interestingly, recreating the games in my engine (using UCI) didn't throw up any issues.

Anyway, adding the "-recover" flag to my cutechess-cli script has solved the problem and I got this output...

Code: Select all

Score of New vs checks: 1024 - 907 - 483  [0.524] 2414
...      New playing White: 503 - 453 - 251  [0.521] 1207
...      New playing Black: 521 - 454 - 232  [0.528] 1207
...      White vs Black: 957 - 974 - 483  [0.496] 2414
Elo difference: 16.9 +/- 12.4, LOS: 99.6 %, DrawRatio: 20.0 %
SPRT: llr 2.96 (100.6%), lbound -2.94, ubound 2.94 - H1 was accepted

Which indicates that the version of my engine which does not check for checks is slightly superior. I suspect that this is due to the time overhead performing those calculations.

Would someone be able to explain what the LOS: 99.6% means in these results though? Is it that I can have a 99.6% confidence in these results?

Kind regards

Simon

gaard · Post by **gaard** » Mon Sep 09, 2024 3:16 pm

YumYum wrote: ↑Mon Sep 09, 2024 12:29 pm Just a quick follow up to this.

Jacques was absolutely right - on very rare occasions the engine was disconnecting or abandoning a game.

Interestingly, recreating the games in my engine (using UCI) didn't throw up any issues.

Anyway, adding the "-recover" flag to my cutechess-cli script has solved the problem and I got this output...
Code: Select all
Score of New vs checks: 1024 - 907 - 483  [0.524] 2414
...      New playing White: 503 - 453 - 251  [0.521] 1207
...      New playing Black: 521 - 454 - 232  [0.528] 1207
...      White vs Black: 957 - 974 - 483  [0.496] 2414
Elo difference: 16.9 +/- 12.4, LOS: 99.6 %, DrawRatio: 20.0 %
SPRT: llr 2.96 (100.6%), lbound -2.94, ubound 2.94 - H1 was accepted
Which indicates that the version of my engine which does not check for checks is slightly superior. I suspect that this is due to the time overhead performing those calculations.

Would someone be able to explain what the LOS: 99.6% means in these results though? Is it that I can have a 99.6% confidence in these results?

Kind regards

Simon

LOS is the cumulative probability for the number of wins being <= 1024 in this case, assuming an even distribution of wins and losses. LOS and Elo aren't as useful in SPRT as they are in fixed number of games. Your alpha/beta SPRT bounds would determine the confidence of this test. Not sure what cutechess-cli defaults to, maybe 95%.

YumYum · Post by **YumYum** » Mon Sep 09, 2024 6:38 pm

gaard wrote: ↑Mon Sep 09, 2024 3:16 pm LOS is the cumulative probability for the number of wins being <= 1024 in this case, assuming an even distribution of wins and losses. LOS and Elo aren't as useful in SPRT as they are in fixed number of games. Your alpha/beta SPRT bounds would determine the confidence of this test. Not sure what cutechess-cli defaults to, maybe 95%.

Thank you that helps.

S

Help Understanding SPRT Results

Help Understanding SPRT Results

Re: Help Understanding SPRT Results

Re: Help Understanding SPRT Results

Re: Help Understanding SPRT Results

Re: Help Understanding SPRT Results

Re: Help Understanding SPRT Results