Self testing vs Gauntlets

jackk03 · Post by **jackk03** » Wed Mar 06, 2024 5:34 pm

Hi, I'm currently developing a new version of my engine, and I frequently come across this problem: since I don't have good resources to test my engine, I can only run a few hundred games, for example 200 games in 10+0.3 format.
This should still be enough, though, since I'm not seeking for improvements of 3-10 elo, but larger ones, that should be visible even with a few hundred games.
Is it better to test against previous version, or make a gauntlet with many other engines? I ask this because I know if one can it's better to do a gauntlet, but with a small number of games the results of the gauntlet have huge +- errors, so I don't know if I can still trust them.

pgg106 · Post by **pgg106** » Thu Mar 07, 2024 1:27 pm

Test against the previous version, preferrably running a proper sprt instead of eyeballing if the change is good or not. The fact you can only run 200 games at "10+0.3" makes me fear you aren't talking about 10 seconds but 10 minutes, if that's the case i suggest drastically reducing the time control.

jackk03 · Post by **jackk03** » Thu Mar 07, 2024 5:39 pm

Thanks for the suggestions

RRr · Post by **RRr** » Thu Mar 07, 2024 9:12 pm

Testing against your own engine might lead to narrow strategies that only work against your own engine,potentially resulting in a false sense of progress. This risk might be lower when seeking significant Elo increases, but I've had it happen to me, and I found it quite disheartening.

Becides lowering the time control, I would suggest searching for opponent engines that are of similar strength, or maybey experiment with giving stronger engines less time to weaken them. The last approach might also helps reduce compute costs. On CCRL you can look for engines by elo, they often contain a github with the source (and if you are feeling dangerous a executable).)

Graham Banks · Post by **Graham Banks** » Thu Mar 07, 2024 10:02 pm

As a tester, I always estimate around 70% of what an engine author claims at self-testing.

pgg106 · Post by **pgg106** » Thu Mar 07, 2024 10:09 pm

Anything that isn't self testing breaks the logic and the common tools used to run sprts, anything that isn't an sprt isn't a reliable way of testing for engine improvements (at least once they stop being 3 digits big), doing or not doing a guantlet isn't a choice, you just can't do it if you want "serious" testing.
As an aside one might run a gauntlet as a form of progression test before a major release or to estimate an initial Elo ranking, that's perfectly valid, but this is about patch to patch testing.
Definitely drop the time control to anything stc-like, most commonly we devs use time controls in the 10s+ 0.1s - 5s+ 0.05s range.

JoAnnP38 · Post by **JoAnnP38** » Fri Mar 08, 2024 2:03 am

You should use self-testing to determine whether a new feature makes your current engine stronger or weaker than your previous engine, and you should use a diverse gauntlet to estimate the size of that gain. During a development/release cycle, I use SPRT tests with self-testing to determine if a new feature is good or not (i.e. +elo or -elo). If it is good then I commit the change to source control, otherwise I abandon the change. Once I have implemented all the features I had planned for the release, I will test my engine against a gauntlet of 10-15 other engines with about have being weaker and half being strong to estimate elo gain against my opponents.

jackk03 · Post by **jackk03** » Fri Mar 08, 2024 7:30 pm

Thanks to everybody. I thought that testing at low times was not a good idea because it's more rare to reach big depths and therefore see some behaviors that only at high depths are visible, but thanks. I'll make more games at low times in self testing

pgg106 · Post by **pgg106** » Fri Mar 08, 2024 11:55 pm

jackk03 wrote: ↑Fri Mar 08, 2024 7:30 pm I thought that testing at low times was not a good idea because it's more rare to reach big depths and therefore see some behaviors that only at high depths are visible

When you are testing stuff that is just generally good (ie: lmr, nmp, see) they are just so good that the average depth reached doesn't matter, they are simply better.
People with enough hardware that are fairly advanced in the dev lifecycle tend to pair stc tests (8s+ 0.08s) with LTC tests (40s + 0.4s) but that's very prohibitive hw wise imo.
Fwiw more than 1 engine in the top 15 ccrl make do with just STCs without particuarly egregious scaling behaviour and even just proper stc sprts are better than eyeballing stuff at 100h + 12 days.

Dann Corbit · Post by **Dann Corbit** » Sat Mar 16, 2024 5:14 pm

jackk03 wrote: ↑Fri Mar 08, 2024 7:30 pm Thanks to everybody. I thought that testing at low times was not a good idea because it's more rare to reach big depths and therefore see some behaviors that only at high depths are visible, but thanks. I'll make more games at low times in self testing

The Stockfish team has VLTC and VVLTC which are run only when the change should be related to depth.
Of course, they have extreme resources that normal small teams will not have.

So, for instance, if you make a change to your null move pruning, I would recommend running a long contest (as long as possible given your time constraints). You can see how much time you have (e.g. one two day weekend ==> 48 hours) and then divide the time by 400 so you can get 400 games. I have seen statistical evidence that 200 games is definitely not enough. Generally, 800 is recommended, but 400 is workable.

But if you are tuning a PST, I don't see why you would need really long time control.

Self testing vs Gauntlets

Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets