The Stockfish ELO problem

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Rebel
Posts: 7353
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

The Stockfish ELO problem

Post by Rebel »

Looking at the CCRL and CEGT rating lists I started to wonder why there is so little progress from version 13 to 15 and decided to do a scaling comparison between Stockfish and Komodo.

Code: Select all

CCRL 40m/15m                 CEGT 40m/20m
Stockfish 15   4CPU   3538   Stockfish 15   1CPU    3592 
Stockfish 14   4CPU   3537   Stockfish 14.1 1CPU    3578
Stockfish 13   4CPU   3536   Stockfish 14   1CPU    3575    
Stockfish 14.1 4CPU   3522   Stockfish 13   1CPU    3563 
                         
CCRL 40m/2m                  CEGT 40m/2m
Stockfish 15   1CPU 3691     Stockfish 14.1 1CPU    3649
Stockfish 14.1 1CPU 3679     Stockfish 14   1CPU    3646 
Stockfish 14   1CPU 3650     Stockfish 15   1CPU    3645
Stockfish 13   1CPU 3624     Stockfish 13   1CPU    3600

Code: Select all

|-------------------------|----------------|----------------|----------------|-----------------|
|         TC=40/10    ELO | TC=40/10   ELO | TC=40/20   ELO | TC=40/40   ELO | TC=40/80   ELO  | 
|-------------------------|----------------|----------------|----------------|-----------------|
| Engine     1 CPU        |    20 CPU      |    20 CPU      |    20 CPU      |    20 CPU       | 
|-------------------------|----------------|----------------|----------------|-----------------|
| SF15       61.7%    +82 |   58.3%    +58 |   54.8%    +33 |   55.1%    +35 |   53.7%    +26  |
| SF14.1     54.1%    +31 |   53.3%    +23 |   51.8%    +12 |   51.3%     +9 |   50.3%     +2  |
| SF14       48.5%    -10 |   46.3%    -26 |   48.7%     -9 |   47.6%    -16 |   48.7%     -9  |
| SF13       35.7%   -100 |   42.0%    -56 |   44.7%    -37 |   46.0%    -28 |   47.3%    -19  |
|-------------------------|----------------|----------------|----------------|-----------------|
| Dragon 2.5 66.2%   +113 |   58.0%    +56 |   58.7%    +61 |   58.5%    +59 |  Done, Komodo   |
| Dragon 2.0 33.8%   -113 |   42.0%    -56 |   41.3%    -61 |   41.5%    -59 |  scales well    |
|-------------------------|----------------|----------------|----------------|-----------------|
|                         |     Equals     |     Equals     |     Equals     |     Equals      |
|                         |  40m in 3m/20s |  40m in 6m/40s |  40m in 13m    |  40m in 26m     |
|-------------------------|----------------|----------------|----------------|-----------------|
Some remarks
1. Komodo scales extremely well (+56,+61,+59).
2. SF15 went down from +82 to +26 (last SF run, equals 40m/26m about CCRL 40/15 | CEGT 40/20 at 2CPU).
3. SF13 went up from -100 to -19.
4. Draw rate last SF run 91.7% but SF15 never lost a game.

Advice for SF team, work on scaling.
90% of coding is debugging, the other 10% is writing bugs.
User avatar
xr_a_y
Posts: 1872
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: The Stockfish ELO problem

Post by xr_a_y »

If i'm not mistaken, I think here : https://tcec-chess.com/bayeselo.txt
is various 32 threads nodes limited SF.
SF seems to scale well with nodes limites (and thus with TC).

Don't know how this blend with your analysis ?
jhellis3
Posts: 548
Joined: Sat Aug 17, 2013 12:36 am

Re: The Stockfish ELO problem

Post by jhellis3 »

If you play better moves earlier, you scale worse ;).
AndrewGrant
Posts: 1960
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: The Stockfish ELO problem

Post by AndrewGrant »

I think this analysis is bunk because I don't trust the samples from CCRL and CEGT, due to repeat engines like Fat Fritz, as well as other "repeat" engines that people argue are not "repeat" engines because they are clueless. If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The Stockfish ELO problem

Post by dkappe »

AndrewGrant wrote: Sun Aug 07, 2022 12:43 am I think this analysis is bunk because I don't trust the samples from CCRL and CEGT, due to repeat engines like Fat Fritz, as well as other "repeat" engines that people argue are not "repeat" engines because they are clueless. If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.
I think you are making an argument that the engines are “repeat” on ethical grounds, not statistical grounds. Maybe Ed can help you out with similarity measures?
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
User avatar
Graham Banks
Posts: 44371
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: The Stockfish ELO problem

Post by Graham Banks »

I'm playing Stockfish 060122 against the engines in the top 12 crosstable that it hasn't played yet.
Hopefully, it will then fall behind Stockfish 15 on the 40/15 list.

All of the games on the top 12 crosstable have been run by me on my 5950x, except for a tiny handful that were run in my Amateur series on my i7-4770k.
gbanksnz at gmail.com
AndrewGrant
Posts: 1960
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: The Stockfish ELO problem

Post by AndrewGrant »

dkappe wrote: Sun Aug 07, 2022 1:59 am
AndrewGrant wrote: Sun Aug 07, 2022 12:43 am I think this analysis is bunk because I don't trust the samples from CCRL and CEGT, due to repeat engines like Fat Fritz, as well as other "repeat" engines that people argue are not "repeat" engines because they are clueless. If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.
I think you are making an argument that the engines are “repeat” on ethical grounds, not statistical grounds. Maybe Ed can help you out with similarity measures?
I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The Stockfish ELO problem

Post by dkappe »

AndrewGrant wrote: Sun Aug 07, 2022 2:33 am I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.
I’d be curious to see your tests on the similarity of SF and FF2. Any pgn’s you could share?
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
AndrewGrant
Posts: 1960
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: The Stockfish ELO problem

Post by AndrewGrant »

dkappe wrote: Sun Aug 07, 2022 2:38 am
AndrewGrant wrote: Sun Aug 07, 2022 2:33 am I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.
I’d be curious to see your tests on the similarity of SF and FF2. Any pgn’s you could share?
Much easier to derive similarity from knowing its the same code, than from looking at PGNs. That can be left as an exercise for the reader as well.
dkappe
Posts: 1632
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: The Stockfish ELO problem

Post by dkappe »

AndrewGrant wrote: Sun Aug 07, 2022 2:39 am
dkappe wrote: Sun Aug 07, 2022 2:38 am
AndrewGrant wrote: Sun Aug 07, 2022 2:33 am I do mean statistical. If you test SF/Komodo against the pool of { Stockfish, Komodo, Houdini, Sugar, Shashchess, Fat Fritz II, Fire, Ethereal, Leela, Berserk, Koivisto }, that pool is heavily skewed towards a Stockfish engine. Which means any result could be a result of a particular ability or inability to play against Stockfish. Competing hypothesis for the results seen.

I don't mean ethical, which is why I posted a list of engines above and a reader can make their own determination how much of the pool is Stockfish.
I’d be curious to see your tests on the similarity of SF and FF2. Any pgn’s you could share?
Much easier to derive similarity from knowing its the same code, than from looking at PGNs. That can be left as an exercise for the reader as well.
You don’t have any evidence? By that logic, Ethereal would never improve as it’s Nets improved as the Engine code remained mostly the same.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".