Self testing vs Gauntlets

JacquesRW · Post by **JacquesRW** » Sun Mar 17, 2024 5:59 pm

Dann Corbit wrote: ↑Sat Mar 16, 2024 5:14 pm Generally, 800 is recommended, but 400 is workable.

Recommended by who? This is terrible advice.
Here's an example for a change after playing 800 games and getting the results wins=270, losses=230, draws=300

Code: Select all

py sprt.py --wins 270 --losses 230 --draws 300 --elo0 0 --elo1 10
ELO: 17.4 +- 19.0 [-1.6, 36.5]
LLR: 1.19 [0.0, 10.0] (-2.94, 2.94)
Continue Playing

This is far away from an SPRT passing (with very loose bounds, the standard [0, 3] or [0, 5] would be even further away). Now imagine a change that only resulted in an elo increase of ~10 elo, what conclusions can you reasonably draw from 800 games?

pgg106 · Post by **pgg106** » Sun Mar 17, 2024 11:42 pm

Yeah the idea that a fixed number of games (any fixed number of games) is enough is bogus, if that was the case all the tests ever would be capped to that magic number (and quite clearly they are not).
As i said before proper stc sprts are better than eyeballing stuff at 100h + 12 days, if you don't have the resources to run ltcs just run stcs or even vstc, that will serve you far better in the long run than any 800 games testing session or whatever.

Uri Blass · Post by **Uri Blass** » Mon Mar 18, 2024 8:34 am

pgg106 wrote: ↑Thu Mar 07, 2024 10:09 pm Anything that isn't self testing breaks the logic and the common tools used to run sprts, anything that isn't an sprt isn't a reliable way of testing for engine improvements (at least once they stop being 3 digits big), doing or not doing a guantlet isn't a choice, you just can't do it if you want "serious" testing.
As an aside one might run a gauntlet as a form of progression test before a major release or to estimate an initial Elo ranking, that's perfectly valid, but this is about patch to patch testing.
Definitely drop the time control to anything stc-like, most commonly we devs use time controls in the 10s+ 0.1s - 5s+ 0.05s range.

I disagree that fixed number of games is not a reliable way of testing for engine improvement.

You practically want to know not only if a new version is better but also how much better and sprt does not give a good estimate.

It is possible to test every change that passed sprt later with a fixed number of games so you have some unbiased estimate for the improvement you made with that change.

The progress will be slower but I think people may decide that understanding the value of every change they make is more important than fast progress.

Understanding the value of every change may be productive to decide about the changes to test later.

RubiChess · Post by **RubiChess** » Mon Mar 18, 2024 9:03 am

JacquesRW wrote: ↑Sun Mar 17, 2024 5:59 pm
Dann Corbit wrote: ↑Sat Mar 16, 2024 5:14 pm Generally, 800 is recommended, but 400 is workable.
Recommended by who? This is terrible advice.
Here's an example for a change after playing 800 games and getting the results wins=270, losses=230, draws=300
Code: Select all
py sprt.py --wins 270 --losses 230 --draws 300 --elo0 0 --elo1 10
ELO: 17.4 +- 19.0 [-1.6, 36.5]
LLR: 1.19 [0.0, 10.0] (-2.94, 2.94)
Continue Playing
This is far away from an SPRT passing (with very loose bounds, the standard [0, 3] or [0, 5] would be even further away). Now imagine a change that only resulted in an elo increase of ~10 elo, what conclusions can you reasonably draw from 800 games?

Can we make this post sticky?
Even better: Everybody should sign "I understand and agree to this" before posting anything here.

pgg106 · Post by **pgg106** » Mon Mar 18, 2024 9:52 am

Uri Blass wrote: ↑Mon Mar 18, 2024 8:34 am
pgg106 wrote: ↑Thu Mar 07, 2024 10:09 pm Anything that isn't self testing breaks the logic and the common tools used to run sprts, anything that isn't an sprt isn't a reliable way of testing for engine improvements (at least once they stop being 3 digits big), doing or not doing a guantlet isn't a choice, you just can't do it if you want "serious" testing.
As an aside one might run a gauntlet as a form of progression test before a major release or to estimate an initial Elo ranking, that's perfectly valid, but this is about patch to patch testing.
Definitely drop the time control to anything stc-like, most commonly we devs use time controls in the 10s+ 0.1s - 5s+ 0.05s range.
I disagree that fixed number of games is not a reliable way of testing for engine improvement.

You practically want to know not only if a new version is better but also how much better and sprt does not give a good estimate.

It is possible to test every change that passed sprt later with a fixed number of games so you have some unbiased estimate for the improvement you made with that change.

The progress will be slower but I think people may decide that understanding the value of every change they make is more important than fast progress.

Understanding the value of every change may be productive to decide about the changes to test later.

What you are describing is a progression test, they can and should be run when you have presumably achieved significant progress. Doing one for every patch is just a meaningless waste of cores, considering we are taking about hw constrained testing it's terrible advice.
People can decide to do it, it doesn't mean it's good or that it should be suggested.

Uri Blass · Post by **Uri Blass** » Mon Mar 18, 2024 11:49 pm

pgg106 wrote: ↑Mon Mar 18, 2024 9:52 am
Uri Blass wrote: ↑Mon Mar 18, 2024 8:34 am
pgg106 wrote: ↑Thu Mar 07, 2024 10:09 pm Anything that isn't self testing breaks the logic and the common tools used to run sprts, anything that isn't an sprt isn't a reliable way of testing for engine improvements (at least once they stop being 3 digits big), doing or not doing a guantlet isn't a choice, you just can't do it if you want "serious" testing.
As an aside one might run a gauntlet as a form of progression test before a major release or to estimate an initial Elo ranking, that's perfectly valid, but this is about patch to patch testing.
Definitely drop the time control to anything stc-like, most commonly we devs use time controls in the 10s+ 0.1s - 5s+ 0.05s range.
I disagree that fixed number of games is not a reliable way of testing for engine improvement.

You practically want to know not only if a new version is better but also how much better and sprt does not give a good estimate.

It is possible to test every change that passed sprt later with a fixed number of games so you have some unbiased estimate for the improvement you made with that change.

The progress will be slower but I think people may decide that understanding the value of every change they make is more important than fast progress.

Understanding the value of every change may be productive to decide about the changes to test later.
What you are describing is a progression test, they can and should be run when you have presumably achieved significant progress. Doing one for every patch is just a meaningless waste of cores, considering we are taking about hw constrained testing it's terrible advice.
People can decide to do it, it doesn't mean it's good or that it should be suggested.

I explained that the target does not have to be to improve the engine as fast as possible but understanding.
I think it may be interesting to know if a specific change improved the engine by 1 elo or by 10 elo.

Uri Blass · Post by **Uri Blass** » Mon Mar 18, 2024 11:58 pm

RubiChess wrote: ↑Mon Mar 18, 2024 9:03 am
JacquesRW wrote: ↑Sun Mar 17, 2024 5:59 pm
Dann Corbit wrote: ↑Sat Mar 16, 2024 5:14 pm Generally, 800 is recommended, but 400 is workable.
Recommended by who? This is terrible advice.
Here's an example for a change after playing 800 games and getting the results wins=270, losses=230, draws=300
Code: Select all
py sprt.py --wins 270 --losses 230 --draws 300 --elo0 0 --elo1 10
ELO: 17.4 +- 19.0 [-1.6, 36.5]
LLR: 1.19 [0.0, 10.0] (-2.94, 2.94)
Continue Playing
This is far away from an SPRT passing (with very loose bounds, the standard [0, 3] or [0, 5] would be even further away). Now imagine a change that only resulted in an elo increase of ~10 elo, what conclusions can you reasonably draw from 800 games?
Can we make this post sticky?
Even better: Everybody should sign "I understand and agree to this" before posting anything here.

I understand and disagree.

The SPRT also assumes that games are not dependent and when you test 400 positions twice with reversed colors it is not the case.

Practically if you get the same positions with white and black and if engine A win 40 pairs and lose 0 pairs then I think the result is significant even if it is only 270-230 and 300 draws.
If both engines win many pairs then the same 270-230 is not significant.

JacquesRW · Post by **JacquesRW** » Tue Mar 19, 2024 12:51 am

Uri Blass wrote: ↑Mon Mar 18, 2024 11:58 pm I understand and disagree.

The SPRT also assumes that games are not dependent and when you test 400 positions twice with reversed colors it is not the case.

Practically if you get the same positions with white and black and if engine A win 40 pairs and lose 0 pairs then I think the result is significant even if it is only 270-230 and 300 draws.
If both engines win many pairs then the same 270-230 is not significant.

Everyone accounts for this nowadays - see Fishtest or either of the two largest OB instances (https://chess.swehosting.se/ and http://chess.grantnet.us/), so you have no point. I only used the older SPRT method to demonstrate the issue with 800 games, and I felt bringing pentanomial SPRT up wouldn't have been helpful, and doesn't have any impact on the point (800 games could be enough in extreme cases, but that doesn't matter for the majority of the tests we run).

pgg106 · Post by **pgg106** » Tue Mar 19, 2024 9:30 am

Uri Blass wrote: ↑Mon Mar 18, 2024 11:49 pm I explained that the target does not have to be to improve the engine as fast as possible but understanding.
I think it may be interesting to know if a specific change improved the engine by 1 elo or by 10 elo.

You understand it to then do what? the end result is just merging the improvement so you can start working on the next right away, whether it's a tweak to the gainer you just merged or a new idea. You aren't just wasting games, you are wasting games on "insights" that don't matter and aren't usable.
No one does this since it's a pointless compute sink, it only gets more egregious and pointless in the limited hw framework we are talking about.

AndrewGrant · Post by **AndrewGrant** » Tue Mar 19, 2024 10:48 am

I would recommend taking the advice of the people here, except for Uri, as the others have experience in the area, and he is lacking.

Self-play is king for testing. It allows you to massively reduce the error bars on your testing, saving you 4x as many games at a minimum if you follow the math, and possibly more depending on what sort of gauntlet you would intend to run. OpenBench, a platform mentioned here, was originally called EtherBench, and derived SPRT-like formulas for testing a single engine against a pool of engines.

As cool as it was, in the most simple case, you must cut all the error bars at least in half to be able to combine them in some sense with as much meaning as an SPRT self-play. To cut the error bars in half, you generally need to play 4x as many games, in some N-nomial distribution. This is because the error is in relation to the sqrt of the sample size.

If you have found yourself having infinite resources, then I would suggest you employ the gauntlet approach. Otherwise, you would be hurting yourself to do so.

----

Self-play has proved to be extremely powerful and reliable, allowing engines like Stockfish and Torch to become extremely powerful. Engines like my own, ( Ethereal) one of the first non-Stockfish/non-Komodo engines to have employed self-play SPRT with high computing volume, quickly rose to be the 3rd strongest non clone engine of the time.

There is a constant fear in self-play -- that some patch will improve self-play but hurt against a 3rd party, or that an elo neutral patch would have actually done better against a 3rd party. This fear tends to be false, and evidence of its existence is scant, although I would say does exist.

----

You, as an engine developer, will arrive at some sense of "intuition" as you invest more time into the space, when it comes to understanding how time control conditions might impact the components of your engine that you are tweaking. Sometimes this can come in a more obvious form -- where a slower but smarter eval loses at fast times and wins at longer times -- but sometimes a sort of "artful" insight is needed.

I have a better than "expected" success rate when it comes to predicting that a patch that failed at shorter time controls will indeed pass at longer time controls. But despite my experience, my "better than" is only a fair bit better than random chance.

Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets

Re: Self testing vs Gauntlets