The Stockfish ELO problem

Rebel · Post by **Rebel** » Sat Aug 06, 2022 10:04 pm

Looking at the CCRL and CEGT rating lists I started to wonder why there is so little progress from version 13 to 15 and decided to do a scaling comparison between Stockfish and Komodo.

Code: Select all

CCRL 40m/15m                 CEGT 40m/20m
Stockfish 15   4CPU   3538   Stockfish 15   1CPU    3592 
Stockfish 14   4CPU   3537   Stockfish 14.1 1CPU    3578
Stockfish 13   4CPU   3536   Stockfish 14   1CPU    3575    
Stockfish 14.1 4CPU   3522   Stockfish 13   1CPU    3563 
                         
CCRL 40m/2m                  CEGT 40m/2m
Stockfish 15   1CPU 3691     Stockfish 14.1 1CPU    3649
Stockfish 14.1 1CPU 3679     Stockfish 14   1CPU    3646 
Stockfish 14   1CPU 3650     Stockfish 15   1CPU    3645
Stockfish 13   1CPU 3624     Stockfish 13   1CPU    3600

Code: Select all

|-------------------------|----------------|----------------|----------------|-----------------|
|         TC=40/10    ELO | TC=40/10   ELO | TC=40/20   ELO | TC=40/40   ELO | TC=40/80   ELO  | 
|-------------------------|----------------|----------------|----------------|-----------------|
| Engine     1 CPU        |    20 CPU      |    20 CPU      |    20 CPU      |    20 CPU       | 
|-------------------------|----------------|----------------|----------------|-----------------|
| SF15       61.7%    +82 |   58.3%    +58 |   54.8%    +33 |   55.1%    +35 |   53.7%    +26  |
| SF14.1     54.1%    +31 |   53.3%    +23 |   51.8%    +12 |   51.3%     +9 |   50.3%     +2  |
| SF14       48.5%    -10 |   46.3%    -26 |   48.7%     -9 |   47.6%    -16 |   48.7%     -9  |
| SF13       35.7%   -100 |   42.0%    -56 |   44.7%    -37 |   46.0%    -28 |   47.3%    -19  |
|-------------------------|----------------|----------------|----------------|-----------------|
| Dragon 2.5 66.2%   +113 |   58.0%    +56 |   58.7%    +61 |   58.5%    +59 |  Done, Komodo   |
| Dragon 2.0 33.8%   -113 |   42.0%    -56 |   41.3%    -61 |   41.5%    -59 |  scales well    |
|-------------------------|----------------|----------------|----------------|-----------------|
|                         |     Equals     |     Equals     |     Equals     |     Equals      |
|                         |  40m in 3m/20s |  40m in 6m/40s |  40m in 13m    |  40m in 26m     |
|-------------------------|----------------|----------------|----------------|-----------------|

Some remarks
1. Komodo scales extremely well (+56,+61,+59).
2. SF15 went down from +82 to +26 (last SF run, equals 40m/26m about CCRL 40/15 | CEGT 40/20 at 2CPU).
3. SF13 went up from -100 to -19.
4. Draw rate last SF run 91.7% but SF15 never lost a game.

Advice for SF team, work on scaling.

xr_a_y · Post by **xr_a_y** » Sat Aug 06, 2022 10:42 pm

If i'm not mistaken, I think here : https://tcec-chess.com/bayeselo.txt
is various 32 threads nodes limited SF.
SF seems to scale well with nodes limites (and thus with TC).

Don't know how this blend with your analysis ?

jhellis3 · Post by **jhellis3** » Sun Aug 07, 2022 12:06 am

If you play better moves earlier, you scale worse

.

AndrewGrant · Post by **AndrewGrant** » Sun Aug 07, 2022 12:43 am

I think this analysis is bunk because I don't trust the samples from CCRL and CEGT, due to repeat engines like Fat Fritz, as well as other "repeat" engines that people argue are not "repeat" engines because they are clueless. If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.

dkappe · Post by **dkappe** » Sun Aug 07, 2022 1:59 am

AndrewGrant wrote: ↑Sun Aug 07, 2022 12:43 am I think this analysis is bunk because I don't trust the samples from CCRL and CEGT, due to repeat engines like Fat Fritz, as well as other "repeat" engines that people argue are not "repeat" engines because they are clueless. If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.

I think you are making an argument that the engines are “repeat” on ethical grounds, not statistical grounds. Maybe Ed can help you out with similarity measures?

Graham Banks · Post by **Graham Banks** » Sun Aug 07, 2022 2:25 am

I'm playing Stockfish 060122 against the engines in the top 12 crosstable that it hasn't played yet.
Hopefully, it will then fall behind Stockfish 15 on the 40/15 list.

All of the games on the top 12 crosstable have been run by me on my 5950x, except for a tiny handful that were run in my Amateur series on my i7-4770k.

AndrewGrant · Post by **AndrewGrant** » Sun Aug 07, 2022 2:33 am

dkappe wrote: ↑Sun Aug 07, 2022 1:59 am
AndrewGrant wrote: ↑Sun Aug 07, 2022 12:43 am I think this analysis is bunk because I don't trust the samples from CCRL and CEGT, due to repeat engines like Fat Fritz, as well as other "repeat" engines that people argue are not "repeat" engines because they are clueless. If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.
I think you are making an argument that the engines are “repeat” on ethical grounds, not statistical grounds. Maybe Ed can help you out with similarity measures?

I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.

dkappe · Post by **dkappe** » Sun Aug 07, 2022 2:38 am

AndrewGrant wrote: ↑Sun Aug 07, 2022 2:33 am I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.

I’d be curious to see your tests on the similarity of SF and FF2. Any pgn’s you could share?

AndrewGrant · Post by **AndrewGrant** » Sun Aug 07, 2022 2:39 am

dkappe wrote: ↑Sun Aug 07, 2022 2:38 am
AndrewGrant wrote: ↑Sun Aug 07, 2022 2:33 am I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.
I’d be curious to see your tests on the similarity of SF and FF2. Any pgn’s you could share?

Much easier to derive similarity from knowing its the same code, than from looking at PGNs. That can be left as an exercise for the reader as well.

dkappe · Post by **dkappe** » Sun Aug 07, 2022 4:16 am

AndrewGrant wrote: ↑Sun Aug 07, 2022 2:39 am
dkappe wrote: ↑Sun Aug 07, 2022 2:38 am
AndrewGrant wrote: ↑Sun Aug 07, 2022 2:33 am I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.
I’d be curious to see your tests on the similarity of SF and FF2. Any pgn’s you could share?
Much easier to derive similarity from knowing its the same code, than from looking at PGNs. That can be left as an exercise for the reader as well.

You don’t have any evidence? By that logic, Ethereal would never improve as it’s Nets improved as the Engine code remained mostly the same.

The Stockfish ELO problem

The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem

Re: The Stockfish ELO problem