I'm currently developing an engine "YumYum" in C#. You'll see it from time-to-time as a Lichess Bot.
I've read a number of interesting posts here about using SPRT to judge whether changes have improved the strength of an engine and decided to give it a go using CuteChess-CLI (version 1.3.1).
I ran a tournament between two versions of the same engine which have very slightly different move ordering. So I think:
H0 = there is no difference in the performance of the two engines
H1 = would be there is a difference in the performance of the two engines.
I'm not sure I fully understand the test results though and was hoping that someone more experienced could help clarify them for me please.
The following was used to start the tournament.
Code: Select all
cutechess-cli.exe -engine name=New cmd="Chess_UCI_v2.exe" proto=uci -engine name=checks cmd="Chess_UCI_v2.exe"
proto=uci option."Order on checks"=true -each tc=10+1 -concurrency 7 -sprt elo0=0 elo1=10 alpha=0.05 beta=0.05 -rounds 5000
-openings file="8moves_v3.pgn" format=pgn
Code: Select all
Score of New vs checks: 364 - 346 - 218 [0.510] 928
... New playing White: 181 - 167 - 116 [0.515] 464
... New playing Black: 183 - 179 - 102 [0.504] 464
... White vs Black: 360 - 350 - 218 [0.505] 928
Elo difference: 6.7 +/- 19.5, LOS: 75.0 %, DrawRatio: 23.5 %
SPRT: llr 0.174 (5.9%), lbound -2.94, ubound 2.94
My understanding is:
-that 928 games were complete out of the maximum of 5000 specified which indicates it played enough to reach a decision.
-the ELO difference of 6.7 +/- 19.5 suggests that the first engine ("New") is between 12.8 ELO worse and 26.2 ELO better than the engine it was being tested against ("checks"). The variance does seem quite large in comparison to the ELO difference though. Is there any way to reduce this?
-the LOS of 75% means that the first engine is likely to be better than the second engine. However, I'm not sure what the 75% actually refers too. Clearly it doesn't mean that the first engine is likely to win 75% of matches against the second one.
- The llr is some form of measure of whether the engine is outside the specified ELO. I have no idea though what the specific numbers are telling me.
Also, I note that the output did not give me a judgment on H0 or H1. Is that to be expected? A number of other posts I've seen included things like "H0 was accepted" in the SPRT output.
As you may have gathered, I'm not a statistician so any thoughts you can offer are most appreciated.
Kind wishes
Simon