Stockfish 14.1 regression

carldaman · Post by **carldaman** » Sun Nov 07, 2021 10:08 pm

I know this possibility has been brought up since the release of SF14.1, but with CCRL and CEGT results out, it's now looking more like a certainty. It's not too surprising, either, given the odd buggy behavior where SF struggles to win while a queen up(!) and I wonder how much of this disappointing regression comes from that flaw alone.

Still, one also has to wonder about the testing procedures that the SF framework is using. It all seems to be geared towards quantity (of games) rather than quality, or else something like this wouldn't have been released so quickly.

If I were the Komodo developers, I would be all smiles.

It seems that commercial engines aren't dead after all!

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Nov 07, 2021 10:34 pm

You seem to have misinterpreted CCRL's results. After 697 SF 14.1 games, CCRL Blitz has SF 14.1 at 3684 elo (while SF 14 is at 3656 Elo on the CCRL Blitz list). Even with 697 games, the CI remains too wide to really say anything with confidence, though it does seem likely SF 14.1 is a fair bit stronger. CCRL 40/15 places SF 14.1 at 3503 Elo after only 330 games and SF 14 at 3505 Elo. Suggesting a -2 Elo regression when the 95% CI is at +/-30 is ridiculous.

Please try to do more research prior to posting misinformation.

Werner · Post by **Werner** » Sun Nov 07, 2021 11:42 pm

CEGT results
40/20
2 Stockfish 14.0NNUE x64 1CPU 3583 14 14 1921 65.4% 3468 67.0%
3 Stockfish 14.1NNUE x64 1CPU 3578 16 16 1161 60.8% 3498 76.1%

40/4
Stockfish 14.1 NNUE x64 1CPU = ca. ELO 3685 out of 2000 games (+1 / +27)
Stockfish 20211021 NNUE x64 1CPU = ca. ELO 3684 out of 2000 games
Stockfish 14.0 NNUE x64 1CPU = ELO 3658 out of 3500 games

carldaman · Post by **carldaman** » Mon Nov 08, 2021 12:42 am

connor_mcmonigle wrote: ↑Sun Nov 07, 2021 10:34 pm You seem to have misinterpreted CCRL's results. After 697 SF 14.1 games, CCRL Blitz has SF 14.1 at 3684 elo (while SF 14 is at 3656 Elo on the CCRL Blitz list). Even with 697 games, the CI remains too wide to really say anything with confidence, though it does seem likely SF 14.1 is a fair bit stronger. CCRL 40/15 places SF 14.1 at 3503 Elo after only 330 games and SF 14 at 3505 Elo. Suggesting a -2 Elo regression when the 95% CI is at +/-30 is ridiculous.

Please try to do more research prior to posting misinformation.

A regression is suggested by the numbers as well as the engine's behavior. I'm only calling it like I see it.
Of course, it could be a slight and negligible regression in the end, or maybe none at all at faster time controls.
I'm more interested in LTC, anyway, and those are the results I checked.

We are so quick-triggered these days to label things as 'misinformation'. Let's see if the claim is wrong first!
CCRL only requires 200 LTC games to make a rating 'official' and 330 is much more than that, while CEGT has even more games.

carldaman · Post by **carldaman** » Mon Nov 08, 2021 12:53 am

carldaman wrote: ↑Mon Nov 08, 2021 12:42 am
connor_mcmonigle wrote: ↑Sun Nov 07, 2021 10:34 pm You seem to have misinterpreted CCRL's results. After 697 SF 14.1 games, CCRL Blitz has SF 14.1 at 3684 elo (while SF 14 is at 3656 Elo on the CCRL Blitz list). Even with 697 games, the CI remains too wide to really say anything with confidence, though it does seem likely SF 14.1 is a fair bit stronger. CCRL 40/15 places SF 14.1 at 3503 Elo after only 330 games and SF 14 at 3505 Elo. Suggesting a -2 Elo regression when the 95% CI is at +/-30 is ridiculous.

Please try to do more research prior to posting misinformation.
A regression is suggested by the numbers as well as the engine's behavior. I'm only calling it like I see it.
Of course, it could be a slight and negligible regression in the end, or maybe none at all at faster time controls.
I'm more interested in LTC, anyway, and those are the results I checked. [Edit: I should have made that clear in the original post.]

We are so quick-triggered these days to label things as 'misinformation'. Let's see if the claim is wrong first!
CCRL only requires 200 LTC games to make a rating 'official' and 330 is much more than that, while CEGT has even more games.

DrCliche · Post by **DrCliche** » Mon Nov 08, 2021 1:05 am

carldaman wrote: ↑Mon Nov 08, 2021 12:42 amWe are so quick-triggered these days to label things as 'misinformation'. Let's see if the claim is wrong first!

Well, you made wildly exaggerated claims that appear to have no basis in reality. That sounds like misinformation to me ¯\_(ツ)_/¯ .

"Hey now, I can't prove my hyperbolic bullshit, which is objectively almost certainly wrong, but there's a vanishingly small chance I'm right, so let's just wait and see you trigger happy cancelbots!!!!"

carldaman · Post by **carldaman** » Mon Nov 08, 2021 1:13 am

SF14.1 appears slightly lower rated at LTC on two rating lists and is exhibiting buggy behavior on top of that.
It's not unwarranted to call that out. It doesn't sound like wild claims at all to me.
We should monitor the situation to see how things develop. Where there's smoke, there's also often fire.

Being oblivious to such things is ostrich-like behavior, imo.

Graham Banks · Post by **Graham Banks** » Mon Nov 08, 2021 1:32 am

What concerns me (and it could well change) is that after 196 games for the 40/15 list, SF 14.1 64-bit 4CPU is -29 Elo to SF14 64-bit 4CPU.

Yes- the engine is set up correctly, and it's running on the 5950x.

carldaman · Post by **carldaman** » Mon Nov 08, 2021 1:36 am

Thanks for that info, Graham.

Also, I came across this interesting recent post from fishcooking, about regressive patches:

https://groups.google.com/g/fishcooking/c/ucZK0gAGJ68

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Nov 08, 2021 1:46 am

carldaman wrote: ↑Mon Nov 08, 2021 1:13 am SF14.1 appears slightly lower rated at LTC on two rating lists and is exhibiting buggy behavior on top of that.
It's not unwarranted to call that out. It doesn't sound like wild claims at all to me.
We should monitor the situation to see how things develop. Where there's smoke, there's also often fire.

Being oblivious to such things is ostrich-like behavior, imo.

CEGT's results show -5 elo with a +/-16 elo CI at LTC. Therefore, we can't come close to claiming a regression given that data. Statistics doesn't care how many games CCRL/CEGT considers sufficient for assigning a rating to an engine. Therefore, the following quote constitutes misinformation in my opinion:

I know this possibility has been brought up since the release of SF14.1, but with CCRL and CEGT results out, it's now looking more like a certainty.

It's far from a certainty.

Stockfish 14.1 regression

Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression

Re: Stockfish 14.1 regression