Hybride replacemment strategy worse than always-replace

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Ciekce
Posts: 127
Joined: Sun Oct 30, 2022 5:26 pm
Full name: Conor Anstey

Re: Hybride replacemment strategy worse than always-replace

Post by Ciekce »

hgm wrote: Thu Apr 25, 2024 4:25 pm Of course it is only better in the sense that on average you would need fewer games games; if a pre-determined number of games eventually produces a LOS > 99% you would have reached the same conclusion with an SPRT on setting the confidence level to, say, 95%. It would just have stopped somewhat earlier.
LOS is a useless metric.
hgm wrote: Thu Apr 25, 2024 4:25 pm A disadvantage of SPRT is that it only tests a single feature at the time, and is not easy to adapt to testing multiple changes simultaneously. That can make it unnecessarily slow.
Nor can any other testing methodology - testing multiple changes at once tells you nothing about the individual changes.

Why are you, in a position of authority as a moderator of this forum, using it to spread misinformation?
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Hybride replacemment strategy worse than always-replace

Post by pgg106 »

You can bulk test changes, everyone just knows better. Mixing a +30 and a -10 Elo change will give you a green and happy sprt totaling "+20" Elo and a landmine in your code that didn't need to be there.
SPRT is better in the sense it guarantees you (enough) statistical significance at the cheapest cost possible, there is no scenario where a fixed number of games is better than it (when trying to test if something gains Elo).
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.
Viz
Posts: 64
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: Hybride replacemment strategy worse than always-replace

Post by Viz »

LOS is such a stupid BS that actually does much more bad than good it's actually hard to believe.
I recall having a patch that was +13 elo perf after some 1500 games - only for it to fail red, so losing more games than winning. Ofc it LOS was > 99,9%
Viz
Posts: 64
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: Hybride replacemment strategy worse than always-replace

Post by Viz »

pgg106 wrote: Thu Apr 25, 2024 4:58 pm You can bulk test changes, everyone just knows better. Mixing a +30 and a -10 Elo change will give you a green and happy sprt totaling "+20" Elo and a landmine in your code that didn't need to be there.
SPRT is better in the sense it guarantees you (enough) statistical significance at the cheapest cost possible, there is no scenario where a fixed number of games is better than it (when trying to test if something gains Elo).
There is one scenario where it can be useful - when you have a test that is like -15 elo at 10+0.1 and want to see if it's anywhere near -15 at 60+0.6, there you will need fixed games. If it shows that it's much better you can run 180+1.8 and bigger SPRTs then.
But this is really specific scenario and for really good engines and a lot of hardware.
User avatar
hgm
Posts: 27837
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hybride replacemment strategy worse than always-replace

Post by hgm »

Ciekce wrote: Thu Apr 25, 2024 4:51 pmLOS is a useless metric.
If you think that you don't understand much of statistics.
hgm wrote: Thu Apr 25, 2024 4:25 pm A disadvantage of SPRT is that it only tests a single feature at the time, and is not easy to adapt to testing multiple changes simultaneously. That can make it unnecessarily slow.
Nor can any other testing methodology - testing multiple changes at once tells you nothing about the individual changes.
Of course you can. You just don't know any. The trick is that you test the alternatives created by the changes in various combinations. If you want to test two changes A and B, you test pure A, pure B and A and B together in comparison with the unchanged version. By adding the results of the tests that only differed in the presence of B, you get twice the number of games that test the effect of A, and by adding the results that only differed in the presence of A you get twice the number of games that test the effect of B. You effectively doubled the number of games per test, with the corresponding increase in significance.
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Hybride replacemment strategy worse than always-replace

Post by pgg106 »

Or you just sprt A, merge/not merge, sprt B, merge/not merge and you are done, and it works, and it's been significant enough for more than a decade.
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.
User avatar
hgm
Posts: 27837
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hybride replacemment strategy worse than always-replace

Post by hgm »

Sure, everything works, if you throw enough games at it. :lol:
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Hybride replacemment strategy worse than always-replace

Post by pgg106 »

Except that sprt is the most sound way to throw the smallest amount of games possible at it, so your remark makes absolutely 0 sense.
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.
User avatar
hgm
Posts: 27837
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hybride replacemment strategy worse than always-replace

Post by hgm »

Well, as I just explained, it is not.
pgg106
Posts: 25
Joined: Wed Mar 09, 2022 3:40 pm
Full name: . .

Re: Hybride replacemment strategy worse than always-replace

Post by pgg106 »

I'm not sure why a wrong explaination that ignores how A interacting with B might fudge the results and doesn't suggest a way to magically (And soudly) add the results of 2 different tests should convince anyone of that.
But i'll entertain you and assume such a magical way of testing exists, the new conclusion is that no one should bother doing such a thing because sprts at stc will be far more foolproof and don't require unthinkable amounts of hardware.
There's no use pointing aspiring new devs towards hypotetical magical testing scenarios that 100% outperform the state of the art (but no one uses) when OpenBench exists and it's best by test.
To anyone reading this post in the future, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.