Hybride replacemment strategy worse than always-replace

Ciekce · Post by **Ciekce** » Thu Apr 25, 2024 4:51 pm

hgm wrote: ↑Thu Apr 25, 2024 4:25 pm Of course it is only better in the sense that on average you would need fewer games games; if a pre-determined number of games eventually produces a LOS > 99% you would have reached the same conclusion with an SPRT on setting the confidence level to, say, 95%. It would just have stopped somewhat earlier.

LOS is a useless metric.

hgm wrote: ↑Thu Apr 25, 2024 4:25 pm A disadvantage of SPRT is that it only tests a single feature at the time, and is not easy to adapt to testing multiple changes simultaneously. That can make it unnecessarily slow.

Nor can any other testing methodology - testing multiple changes at once tells you nothing about the individual changes.

Why are you, in a position of authority as a moderator of this forum, using it to spread misinformation?

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 4:58 pm

You can bulk test changes, everyone just knows better. Mixing a +30 and a -10 Elo change will give you a green and happy sprt totaling "+20" Elo and a landmine in your code that didn't need to be there.
SPRT is better in the sense it guarantees you (enough) statistical significance at the cheapest cost possible, there is no scenario where a fixed number of games is better than it (when trying to test if something gains Elo).

Viz · Post by **Viz** » Thu Apr 25, 2024 4:59 pm

LOS is such a stupid BS that actually does much more bad than good it's actually hard to believe.
I recall having a patch that was +13 elo perf after some 1500 games - only for it to fail red, so losing more games than winning. Ofc it LOS was > 99,9%

Viz · Post by **Viz** » Thu Apr 25, 2024 5:00 pm

pgg106 wrote: ↑Thu Apr 25, 2024 4:58 pm You can bulk test changes, everyone just knows better. Mixing a +30 and a -10 Elo change will give you a green and happy sprt totaling "+20" Elo and a landmine in your code that didn't need to be there.
SPRT is better in the sense it guarantees you (enough) statistical significance at the cheapest cost possible, there is no scenario where a fixed number of games is better than it (when trying to test if something gains Elo).

There is one scenario where it can be useful - when you have a test that is like -15 elo at 10+0.1 and want to see if it's anywhere near -15 at 60+0.6, there you will need fixed games. If it shows that it's much better you can run 180+1.8 and bigger SPRTs then.
But this is really specific scenario and for really good engines and a lot of hardware.

hgm · Post by **hgm** » Thu Apr 25, 2024 6:36 pm

Ciekce wrote: ↑Thu Apr 25, 2024 4:51 pmLOS is a useless metric.

If you think that you don't understand much of statistics.

hgm wrote: ↑Thu Apr 25, 2024 4:25 pm A disadvantage of SPRT is that it only tests a single feature at the time, and is not easy to adapt to testing multiple changes simultaneously. That can make it unnecessarily slow.
Nor can any other testing methodology - testing multiple changes at once tells you nothing about the individual changes.

Of course you can. You just don't know any. The trick is that you test the alternatives created by the changes in various combinations. If you want to test two changes A and B, you test pure A, pure B and A and B together in comparison with the unchanged version. By adding the results of the tests that only differed in the presence of B, you get twice the number of games that test the effect of A, and by adding the results that only differed in the presence of A you get twice the number of games that test the effect of B. You effectively doubled the number of games per test, with the corresponding increase in significance.

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 7:03 pm

Or you just sprt A, merge/not merge, sprt B, merge/not merge and you are done, and it works, and it's been significant enough for more than a decade.

hgm · Post by **hgm** » Thu Apr 25, 2024 7:09 pm

Sure, everything works, if you throw enough games at it.

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 7:11 pm

Except that sprt is the most sound way to throw the smallest amount of games possible at it, so your remark makes absolutely 0 sense.

hgm · Post by **hgm** » Thu Apr 25, 2024 7:19 pm

Well, as I just explained, it is not.

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 7:23 pm

I'm not sure why a wrong explaination that ignores how A interacting with B might fudge the results and doesn't suggest a way to magically (And soudly) add the results of 2 different tests should convince anyone of that.
But i'll entertain you and assume such a magical way of testing exists, the new conclusion is that no one should bother doing such a thing because sprts at stc will be far more foolproof and don't require unthinkable amounts of hardware.
There's no use pointing aspiring new devs towards hypotetical magical testing scenarios that 100% outperform the state of the art (but no one uses) when OpenBench exists and it's best by test.

Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace