What to do when relative strength between engine versions is inconsistent

Discussion of chess software programming and technical issues.

Moderator: Ras

Tearth
Posts: 70
Joined: Thu Feb 25, 2021 5:12 pm
Location: Poland
Full name: Pawel Osikowski

Re: What to do when relative strength between engine versions is inconsistent

Post by Tearth »

This is how I test Inanis on Linux:
nice -n -20 ./cutechess-cli \
-concurrency 3 \
-tournament gauntlet -rounds 5000 -games 2 -repeat -ratinginterval 1 -recover \
-engine cmd="./engines/inanis" name="Inanis DEV" proto=uci option."Crash Files"=true \
-engine cmd="./engines/2800-2900/asymptote" proto=uci \
-engine cmd="./engines/2800-2900/gnucheese-1.00-64" proto=uci \
-engine cmd="./engines/2800-2900/daydreamer" name="daydreamer" proto=uci \
-engine cmd="./engines/2800-2900/Weiawaga" proto=uci \
-engine cmd="./engines/2800-2900/zurichess-luzern-linux-amd64" proto=uci \
-engine cmd="./engines/2800-2900/MinkoChess_1.3_x64" proto=uci \
-engine cmd="./engines/2800-2900/inanis_1_1_1" proto=uci \
-engine cmd="./engines/2800-2900/inanis_1_2_0" proto=uci \
-each tc=inf/2+0.1 timemargin=100 book=/home/ubuntu/books/Perfect2019.bin bookdepth=8 option.Hash=32 \
-resign movecount=3 score=300 twosided=true \
-draw movenumber=50 movecount=5 score=50 \
-maxmoves 100 \
-tb /home/ubuntu/syzygy/
I have 4 available cores on my machine, so one of them is dedicated to lichess bot, and the rest to cutechess (with the maximal priority, so engines won't be disrupted by some system scheduler's hiccups, which is very important at such a fast time controls where every millisecond matters for the overall Elo stability). "tc=inf/2+0.1" may sound pretty fast, but engines at 2800-2900 Elo range are speedy enough to reach some reasonable depth in ~100 milliseconds - none of them loses on time. "Resign" adjudication has an aggressive threshold to save more time on already decided games ("twosided" is a very important flag, to prevent one of the engines from "cheating" and declaring a win by just having a much higher evaluation than the opponent even when the position is still not that conclusive).

I usually play 20,000 games which take around 24 hours to complete, and the result is within +- 4 Elo. This is still a lot if the changes are subtle, so sometimes I have to use my intuition to decide if something was worth it or not. I never use SPRT, since I just generally don't trust self-testing - engines are different and we can't just assume that some change (especially in the evaluation area) will work with other opponents.
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: What to do when relative strength between engine versions is inconsistent

Post by j.t. »

KhepriChess wrote: Wed Feb 01, 2023 4:05 am "Concurrency 30" - Aren't the processes going to trip over each other fighting for compute time with so many running at once? I've always seen people say to do max concurrency at either the number of threads or cores on the CPU. Which I guess is possible if you have some crazy CPU?
This exact script is for a PC with an i9-13900K which has 24 cores/32 threads. You're of course right, playing more concurrent games than you have threads wouldn't make much sense.
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: What to do when relative strength between engine versions is inconsistent

Post by JoAnnP38 »

j.t. wrote: Wed Feb 01, 2023 9:11 pm
KhepriChess wrote: Wed Feb 01, 2023 4:05 am "Concurrency 30" - Aren't the processes going to trip over each other fighting for compute time with so many running at once? I've always seen people say to do max concurrency at either the number of threads or cores on the CPU. Which I guess is possible if you have some crazy CPU?
This exact script is for a PC with an i9-13900K which has 24 cores/32 threads. You're of course right, playing more concurrent games than you have threads wouldn't make much sense.
I am currently running 8 simultaneous games on an AMD Ryzen 9 6900HX with 8 cores and 16 threads which puts a nearly consistent 60-65% load on my system. I could probably run another 4 at my current load. There is certain amount of time wasted in context switching and waiting on I/O that might as well be used on another few games. And while I've never found that "hyper-threads" to be very effective, they do allow a little more parallelism, but not like another core would.
syzygy
Posts: 5695
Joined: Tue Feb 28, 2012 11:56 pm

Re: What to do when relative strength between engine versions is inconsistent

Post by syzygy »

jmcd wrote: Mon Jan 30, 2023 8:47 pm I've been having a problem lately where I make changes to my engine, it beats the previous version by a significant margin, but then when I test it against an even older release, it loses or goes even. Basically a paper scissors rock situation. What is the best way to handle this sort of problem?
If this happens a lot, something may be wrong with the way you are testing. It can of course happen in chess that A>B, B>C and C>A, but it should be rare.

Maybe your are testing on openings with insufficient variation, and the weakest version of the three when playing the strongest version simply tends to play into winning variations by "dumb" luck.

If your testing method is sound and (therefore) the inconsistencies are rare, I would just ignore them. Don't test against every older version and instead use your resources to test new ideas.