SPRT question

Ferdy · Post by **Ferdy** » Fri Nov 14, 2014 6:46 am

wgarvin wrote:Just to speculate for a moment:

I wonder if self-play testing is more problematic for weaker engines, because they are less "well-rounded" in general and they might still have major blind spots. A gauntlet of weak-but-similar-strength engines will have a varied assortment of blind spots--probably very different ones from the engine under test.

So when you test a change to a weaker engine, it is more likely to be a change that addresses one of these "major" blind spots. In self-play, the new version might be able to exploit that against the old version, even if gauntlet opponents would not. Also, whatever other blind-spots it has, will be part of both versions and they won't know how to exploit them against each other, even if gauntlet opponents would. So thats at least two reasons why self-play test results might vary from results against other opponents.

But by the time an engine gets up to the strength of Stockfish, it doesn't have much in the way of "major" weaknesses left! And maybe the character of the changes being tested, and the kind of effects they have, is different too. Suppose the strong engines are all pretty well-rounded, and the changes being tested on them are mostly "small tweaks" that help a little bit in a broad variety of positions. Unless a change is so bad as to cripple the engine somehow, it seems likely that it would help or hurt about the same against Komodo or Houdini as it does against Stockfish itself.

Obviously Stockfish could still get different results from self-play compared to doing a gauntlet, but maybe it doesn't happen often.

If you are near or at the top, it is not easy to improve, and self-testing helps to get you going, by small increments of rating points. What I have observed is that once your one change is an improvement in self-test verified by sprt or LOS with high number of games, do not subject it immediately to gauntlet, you need more of that little change say 4 or more improvements in self-test (say adjust bishop outpost-pass, increase penalty of weak pawns-pass, reduce bonus of rook in 7th rank in ending-pass, increase penalty of double-isolated pawns in ending-pass) after that you are now ready to do gauntlet tests comparing the performance of old version against the incrementally improved version.

Uri Blass · Post by **Uri Blass** » Fri Nov 14, 2014 8:52 am

wgarvin wrote:Just to speculate for a moment:

I wonder if self-play testing is more problematic for weaker engines, because they are less "well-rounded" in general and they might still have major blind spots. A gauntlet of weak-but-similar-strength engines will have a varied assortment of blind spots--probably very different ones from the engine under test.

So when you test a change to a weaker engine, it is more likely to be a change that addresses one of these "major" blind spots. In self-play, the new version might be able to exploit that against the old version, even if gauntlet opponents would not. Also, whatever other blind-spots it has, will be part of both versions and they won't know how to exploit them against each other, even if gauntlet opponents would. So thats at least two reasons why self-play test results might vary from results against other opponents.

But by the time an engine gets up to the strength of Stockfish, it doesn't have much in the way of "major" weaknesses left! And maybe the character of the changes being tested, and the kind of effects they have, is different too. Suppose the strong engines are all pretty well-rounded, and the changes being tested on them are mostly "small tweaks" that help a little bit in a broad variety of positions. Unless a change is so bad as to cripple the engine somehow, it seems likely that it would help or hurt about the same against Komodo or Houdini as it does against Stockfish itself.

Obviously Stockfish could still get different results from self-play compared to doing a gauntlet, but maybe it doesn't happen often.

I think self play is good for every engine.
I still did not see a practical case when a change is good in self play and counter productive against other engines.

It may be interesting if Bob can show a single example for a change that is productive in self play and counter productive against other engines.

He only needs to release source of Crafty A and Crafty B when Crafty B is better than Crafty A but worse than Crafty A if you test against other programs.

When I talk about better and worse then of course it should be with enough number of games so it is clear that it is not a statistical noise.

Uri Blass · Post by **Uri Blass** » Fri Nov 14, 2014 9:16 am

I can add that my opinion is that the difference for weak engine is that it may be better to look for bigger improvement for weak engines.

My opinion is that if you start a new engine then you need to look only for at least 50 elo improvement in self play based on SPRT and only when you fail to find improvement reduce 50 to 40 and later to lower numbers.

Fabio Gobbato · Post by **Fabio Gobbato** » Fri Nov 14, 2014 10:43 am

My engine it is not as strong as stockfish but I have had different result in tuning the futility margin with self play than with gauntlet.

With self play I have reached a lower margin but the gauntlet said that it was not good with other engines.

bob · Post by **bob** » Fri Nov 14, 2014 3:32 pm

Uri Blass wrote:
wgarvin wrote:Just to speculate for a moment:

I wonder if self-play testing is more problematic for weaker engines, because they are less "well-rounded" in general and they might still have major blind spots. A gauntlet of weak-but-similar-strength engines will have a varied assortment of blind spots--probably very different ones from the engine under test.

So when you test a change to a weaker engine, it is more likely to be a change that addresses one of these "major" blind spots. In self-play, the new version might be able to exploit that against the old version, even if gauntlet opponents would not. Also, whatever other blind-spots it has, will be part of both versions and they won't know how to exploit them against each other, even if gauntlet opponents would. So thats at least two reasons why self-play test results might vary from results against other opponents.

But by the time an engine gets up to the strength of Stockfish, it doesn't have much in the way of "major" weaknesses left! And maybe the character of the changes being tested, and the kind of effects they have, is different too. Suppose the strong engines are all pretty well-rounded, and the changes being tested on them are mostly "small tweaks" that help a little bit in a broad variety of positions. Unless a change is so bad as to cripple the engine somehow, it seems likely that it would help or hurt about the same against Komodo or Houdini as it does against Stockfish itself.

Obviously Stockfish could still get different results from self-play compared to doing a gauntlet, but maybe it doesn't happen often.
I think self play is good for every engine.
I still did not see a practical case when a change is good in self play and counter productive against other engines.

It may be interesting if Bob can show a single example for a change that is productive in self play and counter productive against other engines.

He only needs to release source of Crafty A and Crafty B when Crafty B is better than Crafty A but worse than Crafty A if you test against other programs.

When I talk about better and worse then of course it should be with enough number of games so it is clear that it is not a statistical noise.

I gave some specifics in the extension thread. I had several cases where a variation of the threat or singular extension idea would look better than without in self-test, but not against the gauntlet test. I am not big on self-testing because the idea is intuitively problematic when you think about it. But I did decide to run some tests. I was not happy with the results. YMMV of course.

as far as testing numbers go, I never release "noisy data". Not when 30K games only requires an hour or so to complete.

There are some arguments here that are unconvincing to me. Generally of the form "but it works". Lots of sub-optimal things "work", but not as well as a more optimal solution.

jdart · Post by **jdart** » Fri Nov 14, 2014 4:51 pm

bob wrote:IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?

In the gauntlet case, not all wins/draws/losses are the same, because the relative rating of the opponents varies (possibly by a lot). So I think in this case you need to feed the game results into BayesElo or Ordo, which will do a more correct calculation of the rating and error bars.

--Jon

Uri Blass · Post by **Uri Blass** » Fri Nov 14, 2014 5:52 pm

bob wrote:
Uri Blass wrote:
wgarvin wrote:Just to speculate for a moment:

I wonder if self-play testing is more problematic for weaker engines, because they are less "well-rounded" in general and they might still have major blind spots. A gauntlet of weak-but-similar-strength engines will have a varied assortment of blind spots--probably very different ones from the engine under test.

So when you test a change to a weaker engine, it is more likely to be a change that addresses one of these "major" blind spots. In self-play, the new version might be able to exploit that against the old version, even if gauntlet opponents would not. Also, whatever other blind-spots it has, will be part of both versions and they won't know how to exploit them against each other, even if gauntlet opponents would. So thats at least two reasons why self-play test results might vary from results against other opponents.

But by the time an engine gets up to the strength of Stockfish, it doesn't have much in the way of "major" weaknesses left! And maybe the character of the changes being tested, and the kind of effects they have, is different too. Suppose the strong engines are all pretty well-rounded, and the changes being tested on them are mostly "small tweaks" that help a little bit in a broad variety of positions. Unless a change is so bad as to cripple the engine somehow, it seems likely that it would help or hurt about the same against Komodo or Houdini as it does against Stockfish itself.

Obviously Stockfish could still get different results from self-play compared to doing a gauntlet, but maybe it doesn't happen often.
I think self play is good for every engine.
I still did not see a practical case when a change is good in self play and counter productive against other engines.

It may be interesting if Bob can show a single example for a change that is productive in self play and counter productive against other engines.

He only needs to release source of Crafty A and Crafty B when Crafty B is better than Crafty A but worse than Crafty A if you test against other programs.

When I talk about better and worse then of course it should be with enough number of games so it is clear that it is not a statistical noise.
I gave some specifics in the extension thread. I had several cases where a variation of the threat or singular extension idea would look better than without in self-test, but not against the gauntlet test. I am not big on self-testing because the idea is intuitively problematic when you think about it. But I did decide to run some tests. I was not happy with the results. YMMV of course.

as far as testing numbers go, I never release "noisy data". Not when 30K games only requires an hour or so to complete.

There are some arguments here that are unconvincing to me. Generally of the form "but it works". Lots of sub-optimal things "work", but not as well as a more optimal solution.

Note only that if you get better result in self testing and no significant difference against other opponents then I prefer to believe self testing because self testing tend to increase the value of changes and it is possible that the change against other opponents is also positive but below the statistical noise.

bob · Post by **bob** » Fri Nov 14, 2014 6:04 pm

jdart wrote:
bob wrote:IE I simply collect wins/draws/losses (total for all opponents) from Crafty's perspective and then feed that into the SPRT calculation, just as I do when I try crafty vs crafty'?
In the gauntlet case, not all wins/draws/losses are the same, because the relative rating of the opponents varies (possibly by a lot). So I think in this case you need to feed the game results into BayesElo or Ordo, which will do a more correct calculation of the rating and error bars.

--Jon

I always do that anyway. But the point for SPRT is an "early stopping point" that doesn't unnecessarily bias the test...

Michel · Post by **Michel** » Fri Nov 14, 2014 6:44 pm

This is an interesting problem. The solution is a "Generalized Sequential Probability Ratio Test". It is discussed here http://stat.columbia.edu/~jcliu/paper/GSPRT_SQA3.pdf .

In the case of testing for small elo improvements I think it translates to the following practical implementation: assume first that the elo of the foreign opponents is known and assume you want to test for a difference of epsilon between eloA and eloB (elo's of old engine A and new engine B). Then the procedure is to do a SPRT using the LLR for H0:eloA=eloB=elo versus H1:eloA=elo-epsilon/2, eloB=elo+epsllon/2 with elo computed from the sample under H0 (computing the MLE of elo under H1 is a bit harder but I think it will be very close to elo under H0).

If the elo of the foreign opponents is not known then it is ok to compute them from the sample using the maximum likelihood estimator (e.g BayesElo) under H0 (eloA=eloB). The fact that we shifted eloA and eloB symmetrically implies that the MLE for the foreign gauntlet participants will shift very little under H1.

If the elo of the foreign opponents is estimated from the sample, the expected running time of the test will be longer of course.

Michel · Post by **Michel** » Fri Aug 21, 2015 1:52 pm

I thought some more about computing "elo" in the previous post in case A and B have not played the same number of games.

Assume that A and B are close together in strength and are playing in gauntlet fashion against the same foreign opponents. The foreign oponents are also allowed to play each other, but the approximation below assumes A and B do not.

Assume that currently eloA' and eloB' are the elo's computed by BayesElo (which computes the maximum likelihood estimate). Then "elo" under H0 is the weighted average (with respect to the number of games played by A and B) of eloA' and eloB'. Under H1 it is the weighted average of eloA'+epsilon/2, eloB'-epsilon/2 (if the number of games is the same then "elo" is the same under H0 and H1).

It is not clear to me if and when the SPRT will terminate with probabilty one if the number of games of A is kept fixed (it seems like an easy problem, but I have not taken the time to consider it properly). Obviously if eloA is only vaguely known, and the difference between eloA and eloB is small, one will never be able to prove there is a difference no matter how many games B plays (recall that A and B are not playing each other).

Without knowing an answer to this problem it is probably best to keep adding games for A as well.

SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question

Re: SPRT question