Self testing vs Gauntlets

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jackk03
Posts: 14
Joined: Wed Jul 12, 2023 1:38 pm
Full name: Giacomo Porpiglia

Self testing vs Gauntlets

Post by jackk03 »

Hi, I'm currently developing a new version of my engine, and I frequently come across this problem: since I don't have good resources to test my engine, I can only run a few hundred games, for example 200 games in 10+0.3 format.
This should still be enough, though, since I'm not seeking for improvements of 3-10 elo, but larger ones, that should be visible even with a few hundred games.
Is it better to test against previous version, or make a gauntlet with many other engines? I ask this because I know if one can it's better to do a gauntlet, but with a small number of games the results of the gauntlet have huge +- errors, so I don't know if I can still trust them.
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Self testing vs Gauntlets

Post by pgg106 »

Test against the previous version, preferrably running a proper sprt instead of eyeballing if the change is good or not. The fact you can only run 200 games at "10+0.3" makes me fear you aren't talking about 10 seconds but 10 minutes, if that's the case i suggest drastically reducing the time control.
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.
jackk03
Posts: 14
Joined: Wed Jul 12, 2023 1:38 pm
Full name: Giacomo Porpiglia

Re: Self testing vs Gauntlets

Post by jackk03 »

Thanks for the suggestions :)
RRr
Posts: 2
Joined: Thu Aug 03, 2023 1:12 am
Full name: Raoul Voetdijk

Re: Self testing vs Gauntlets

Post by RRr »

Testing against your own engine might lead to narrow strategies that only work against your own engine,potentially resulting in a false sense of progress. This risk might be lower when seeking significant Elo increases, but I've had it happen to me, and I found it quite disheartening.

Becides lowering the time control, I would suggest searching for opponent engines that are of similar strength, or maybey experiment with giving stronger engines less time to weaken them. The last approach might also helps reduce compute costs. On CCRL you can look for engines by elo, they often contain a github with the source (and if you are feeling dangerous a executable).)
User avatar
Graham Banks
Posts: 41654
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: Self testing vs Gauntlets

Post by Graham Banks »

As a tester, I always estimate around 70% of what an engine author claims at self-testing.
gbanksnz at gmail.com
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Self testing vs Gauntlets

Post by pgg106 »

Anything that isn't self testing breaks the logic and the common tools used to run sprts, anything that isn't an sprt isn't a reliable way of testing for engine improvements (at least once they stop being 3 digits big), doing or not doing a guantlet isn't a choice, you just can't do it if you want "serious" testing.
As an aside one might run a gauntlet as a form of progression test before a major release or to estimate an initial Elo ranking, that's perfectly valid, but this is about patch to patch testing.
Definitely drop the time control to anything stc-like, most commonly we devs use time controls in the 10s+ 0.1s - 5s+ 0.05s range.
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.
JoAnnP38
Posts: 248
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Self testing vs Gauntlets

Post by JoAnnP38 »

You should use self-testing to determine whether a new feature makes your current engine stronger or weaker than your previous engine, and you should use a diverse gauntlet to estimate the size of that gain. During a development/release cycle, I use SPRT tests with self-testing to determine if a new feature is good or not (i.e. +elo or -elo). If it is good then I commit the change to source control, otherwise I abandon the change. Once I have implemented all the features I had planned for the release, I will test my engine against a gauntlet of 10-15 other engines with about have being weaker and half being strong to estimate elo gain against my opponents.
jackk03
Posts: 14
Joined: Wed Jul 12, 2023 1:38 pm
Full name: Giacomo Porpiglia

Re: Self testing vs Gauntlets

Post by jackk03 »

Thanks to everybody. I thought that testing at low times was not a good idea because it's more rare to reach big depths and therefore see some behaviors that only at high depths are visible, but thanks. I'll make more games at low times in self testing :)
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Self testing vs Gauntlets

Post by pgg106 »

jackk03 wrote: Fri Mar 08, 2024 7:30 pm I thought that testing at low times was not a good idea because it's more rare to reach big depths and therefore see some behaviors that only at high depths are visible
When you are testing stuff that is just generally good (ie: lmr, nmp, see) they are just so good that the average depth reached doesn't matter, they are simply better.
People with enough hardware that are fairly advanced in the dev lifecycle tend to pair stc tests (8s+ 0.08s) with LTC tests (40s + 0.4s) but that's very prohibitive hw wise imo.
Fwiw more than 1 engine in the top 15 ccrl make do with just STCs without particuarly egregious scaling behaviour and even just proper stc sprts are better than eyeballing stuff at 100h + 12 days.
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.
Dann Corbit
Posts: 12566
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Self testing vs Gauntlets

Post by Dann Corbit »

jackk03 wrote: Fri Mar 08, 2024 7:30 pm Thanks to everybody. I thought that testing at low times was not a good idea because it's more rare to reach big depths and therefore see some behaviors that only at high depths are visible, but thanks. I'll make more games at low times in self testing :)
The Stockfish team has VLTC and VVLTC which are run only when the change should be related to depth.
Of course, they have extreme resources that normal small teams will not have.

So, for instance, if you make a change to your null move pruning, I would recommend running a long contest (as long as possible given your time constraints). You can see how much time you have (e.g. one two day weekend ==> 48 hours) and then divide the time by 400 so you can get 400 games. I have seen statistical evidence that 200 games is definitely not enough. Generally, 800 is recommended, but 400 is workable.

But if you are tuning a PST, I don't see why you would need really long time control.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.