http://www.computerchess.org.uk/ccrl/40 ... ons_only=1RubiChess wrote: ↑Sun Aug 07, 2022 10:02 pmThe main issue in this rating list seems that SF15/4threads wasn't tested, only 14.1. At least SF15/4CPU is not mentioned in http://www.cegt.net/40_40%20Rating%20Li ... liste.htmlSopel wrote: ↑Sun Aug 07, 2022 9:43 pmYou can come up at any result with flawed enough methodology. This has the same issues as CCRL.Rebel wrote: ↑Sun Aug 07, 2022 8:21 pm![]()
Meaning at increasing time control and more threads Komodo can catch up and overtake you? Oh wait, it already happened![]()
But as this list also uses moves/time control, I want to mention this https://github.com/official-stockfish/S ... ssues/4000 again.
Regards, Andreas
The Stockfish ELO problem
Moderator: Ras
-
Graham Banks
- Posts: 45075
- Joined: Sun Feb 26, 2006 10:52 am
- Location: Auckland, NZ
Re: The Stockfish ELO problem
gbanksnz at gmail.com
-
jkominek
- Posts: 98
- Joined: Tue Sep 04, 2018 5:33 am
- Full name: John Kominek
Re: The Stockfish ELO problem
A question for Ed. I downloaded the games played from your Gambit Rating List competition (mainbase-40-2.pgn, dated Dec 15 2021) and processed it to extract book moves. I find 92 unique lines. Since you typically play 200 games per encounter, either your book is under-specified and these matches play 8 duplicate pairs, or I've done something wrong and am missing 8 lines. To cross-check my analysis I looked around for the book on your Rebel web site but could not spot it. Do you have a pgn of your Gambit book posted online?CCRL and CEGT rely on normal openings, TCEC does not. Other examples using unusual openings are the lists of Stefan Pohl and my own GRL. All 3 don't show the CCRL / CEGT pattern and have no problem to show significant elo progress, example. Unusual openings favor Stockfish search.
By my counting grl-20-cores.pgn contains 91 unique opening lines, none novel to the main file.
-
Rebel
- Posts: 7430
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: The Stockfish ELO problem
http://rebel13.nl/gambits-100.pgnjkominek wrote: ↑Wed Aug 24, 2022 8:09 amA question for Ed. I downloaded the games played from your Gambit Rating List competition (mainbase-40-2.pgn, dated Dec 15 2021) and processed it to extract book moves. I find 92 unique lines. Since you typically play 200 games per encounter, either your book is under-specified and these matches play 8 duplicate pairs, or I've done something wrong and am missing 8 lines. To cross-check my analysis I looked around for the book on your Rebel web site but could not spot it. Do you have a pgn of your Gambit book posted online?CCRL and CEGT rely on normal openings, TCEC does not. Other examples using unusual openings are the lists of Stefan Pohl and my own GRL. All 3 don't show the CCRL / CEGT pattern and have no problem to show significant elo progress, example. Unusual openings favor Stockfish search.
By my counting grl-20-cores.pgn contains 91 unique opening lines, none novel to the main file.
My double check PGN util says : 0 doubles.
90% of coding is debugging, the other 10% is writing bugs.
-
jkominek
- Posts: 98
- Joined: Tue Sep 04, 2018 5:33 am
- Full name: John Kominek
Re: The Stockfish ELO problem
Thank you very much Ed!
It could be that your PGN utility is configured to look for exact doubles, including Event, Site and Date fields. This would be a problem if you pulled the lines from multiple sources. One example of a duplicate pair is Round 50 and Round 68:
Code: Select all
[Event "YAT"]
[Site "Deventer"]
[Date "2021.04.21"]
[Round "50"]
[White ""]
[Black ""]
[Result "*"]
[BlackElo ""]
[WhiteElo ""]
1. d4 d5 2. c4 c6 3. Nf3 Nf6 4. Nc3 e6 5. Bg5 dxc4 6. e4 b5 7. e5 h6 8.
Bh4 g5 9. Nxg5 hxg5 10. Bxg5 Nbd7 11. exf6 Bb7 12. g3 Qb6 13. Bg2 O-O-O
14. O-O c5 15. d5 b4 16. Rb1 *
[Event "Noomen Twenty Gambits 2016"]
[Site "Netherlands"]
[Date "2021.00.49"]
[Round "68"]
[White "Semi-Slav"]
[Black "Botwinnik variation"]
[Result "*"]
[BlackElo ""]
[WhiteElo ""]
1. d4 d5 2. c4 c6 3. Nf3 Nf6 4. Nc3 e6 5. Bg5 dxc4 6. e4 b5 7. e5 h6 8.
Bh4 g5 9. Nxg5 hxg5 10. Bxg5 Nbd7 11. exf6 Bb7 12. g3 Qb6 13. Bg2 O-O-O
14. O-O c5 15. d5 b4 16. Rb1 *Code: Select all
d4 d5 c4 c6 Nf3 Nf6 Nc3 e6 Bg5 dxc4 e4 b5 e5 h6 Bh4 g5 Nxg5 hxg5 Bxg5 Nbd7 exf6 Bb7 g3 Qb6 Bg2 O-O-O O-O c5 d5 b4 Rb1
d4 d5 c4 e6 Nc3 c6 e4 dxe4 Nxe4 Bb4+ Bd2 Qxd4 Bxb4 Qxe4+
d4 Nf6 Bg5 Ne4 Bf4 c5 f3 Qa5+ c3 Nf6 d5 Qb6 e4 Qxb2 Nd2 Qxc3 Bc7
d4 Nf6 c4 c5 d5 b5 cxb5 a6 f3 e6 e4 exd5 e5 Qe7 Qe2 Ng8 Nc3 Bb7 Nh3 c4
d4 Nf6 c4 c5 d5 e6 Nf3 b5 dxe6 fxe6 cxb5 a6 e3 Be7
e4 c6 d4 d5 f3 dxe4 fxe4 e5 Nf3 exd4 Bc4
e4 d5 exd5 Nf6 d4 Bg4 f3 Bf5 c4 e6 dxe6 Nc6 Be3 Qe7
e4 e5 Nf3 Nc6 Bb5 f5 Nc3 fxe4 Nxe4 d5 Nxe5 dxe4 Nxc6 Qg5 Qe2 Nf6 f4 Qxf4 Nxa7+ Bd7 Bxd7+ Kxd7 Qb5+ Ke6
e4 e5 Nf3 Nc6 Bc4 Nf6 Ng5 d5 exd5 Na5 Bb5+ c6 dxc6 bxc6 Be2 h6 Nh3
e4 e6 d4 d5 Nc3 Bb4 Qg4 Nf6 Qxg7 Rg8 Qh6 c5Code: Select all
data/chess/games/Rebel$ grep Event gambits-100.pgn | wc -l
101-
Rebel
- Posts: 7430
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: The Stockfish ELO problem
You are right, I will look into it, thanks.
90% of coding is debugging, the other 10% is writing bugs.
-
Jouni
- Posts: 3758
- Joined: Wed Mar 08, 2006 8:15 pm
- Full name: Jouni Uski
Re: The Stockfish ELO problem
There is interesting experiment https://www.melonimarco.it/en/2021/03/0 ... -of-nodes/ (no date). According to test SF reach practically max ELO 3500 after 10M nodes only. 1 second in modern CPU! After that only marginal gain.
Jouni
-
jkominek
- Posts: 98
- Joined: Tue Sep 04, 2018 5:33 am
- Full name: John Kominek
Re: The Stockfish ELO problem
I like his blog post, and have conducted a very similar experiment myself using fixed node counts (not including LC0) updated to Stockfish 15. My measurements reveal the same relationship between the NNUE and HCE curves, with HCE almost reaching NNUE at 256m nodes. I plan to push the node count over a billion to see how much the gap closes down under convergence. The notable conclusion is that the evaluation net does not gift Stockfish an appreciably higher asymptote. What it does dramatically accomplish is assist in finding good moves at a much much lower node count, hence time, even at a 2:1 nps ratio. The relationship of nodes/time vs. Elo is pushed leftward.Jouni wrote: ↑Fri Sep 02, 2022 3:34 pm There is interesting experiment https://www.melonimarco.it/en/2021/03/0 ... -of-nodes/ (no date). According to test SF reach practically max ELO 3500 after 10M nodes only. 1 second in modern CPU! After that only marginal gain.
Also worthwhile is his companion blog post. https://www.melonimarco.it/en/2021/10/0 ... ntium-90/
His rating list is notable for using classical time control of 40mv/120min (not seconds), and in anchoring the scale to human-computer matches. Included in his calibration pool are Rebel matches. Ed S. had the rare privilege of playing Anand back in the 90s.
https://www.rebel.nl/anand.htm.
Those were quite the distinctive cartoons, btw. I've long wondered what the background story is behind the Rebel cartoons.
-
Modern Times
- Posts: 3780
- Joined: Thu Jun 07, 2012 11:02 pm
Re: The Stockfish ELO problem
That is based on a Pentium 90, his actual time control is 40 moves/125” or 40/130”, so 40 moves in just over 2 minutes
Time for each match has been fixed to 40 moves/120 minutes repeated, calibrated on a Pentium 90 processing power. The processing power has been emulated, after estimation by using benchmarks with real P90 results. Accordingly, on modern PC the effective match time was fixed to 40 moves/125” or 40/130” depending on the PC
-
jkominek
- Posts: 98
- Joined: Tue Sep 04, 2018 5:33 am
- Full name: John Kominek
Re: The Stockfish ELO problem
Andrew made the fair suggestion of drawing a comparison from an identical pool of opponents under identical test conditions. Short of conducting an extensive new experiment, we can approximate it by filtering the CCRL opponent pool. Stockfish 15 and Komodo Dragon 3.1 have an intersection set of 14 engines. Restricting to this subset eliminates one source of variability, and may assist in a clearer picture emerging. Earlier versions of Stockfish and Komodo cannot be included in this comparison as they were paired with different (versions of) engines, and so have little overlap.AndrewGrant wrote: ↑Sun Aug 07, 2022 12:43 am If you want to compare scaling against a pool of opponents, do exactly that. Get the same opponents. Run the same games, same openings, same machines.
Here are the ratings I calculate using ordo. To set the scale each list is single-point anchored to Houdini 6.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 Stockfish 15 1cpu : 3621.2 11.6 630.5 932 68
2 SugaR AI 2.50 1cpu : 3619.4 17.7 52.0 104 50
3 KomodoDragon 3.1 1cpu : 3617.5 11.8 584.0 870 67
4 Fat Fritz 2 1cpu : 3595.2 19.3 54.0 116 47
5 Ethereal 13.75 1cpu : 3543.1 28.3 44.0 112 39
6 Revenge 3.0 1cpu : 3533.2 28.7 42.5 112 38
7 SlowChess 2.9 1cpu : 3502.4 32.0 41.0 121 34
8 Koivisto 8.0 1cpu : 3486.0 32.2 38.5 121 32
9 Berserk 9 1cpu : 3476.0 36.0 35.5 116 31
10 RubiChess 20220223 1cpu : 3469.0 34.4 36.0 121 30
11 RofChade 3.0 1cpu : 3449.5 31.1 48.0 175 27
12 Seer 2.5.0 1cpu : 3445.7 35.7 33.0 122 27
13 Arasan 23.4 1cpu : 3432.4 36.2 38.5 150 26
14 Minic 3.24 1cpu : 3422.6 42.2 26.5 108 25
15 Rebel 15.1 1cpu : 3413.7 42.3 25.5 108 24
16 Houdini 6 1cpu : 3327.0 56.6 16.5 104 16
White advantage = 3.46 +/- 4.54
Draw rate (equal opponents) = 94.09 % +/- 1.42
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 Stockfish 15 4cpu : 3598.2 13.6 357.0 550 65
2 KomodoDragon 3.1 4cpu : 3588.9 14.0 325.0 512 63
3 SugaR AI 2.50 4cpu : 3588.1 18.3 31.5 64 49
4 Fat Fritz 2 4cpu : 3577.1 20.1 30.5 64 48
5 Revenge 3.0 4cpu : 3549.5 30.3 28.0 64 44
6 Berserk 9 4cpu : 3532.7 33.4 26.5 64 41
7 Ethereal 13.75 4cpu : 3521.3 34.7 25.5 64 40
8 SlowChess 2.9 4cpu : 3515.6 36.2 25.0 64 39
9 Koivisto 8.0 4cpu : 3488.8 36.6 29.0 82 35
10 RubiChess 20220223 4cpu : 3455.6 35.3 36.0 116 31
11 RofChade 3.0 4cpu : 3455.3 47.1 20.0 64 31
12 Igel 3.1.0 4cpu : 3455.3 45.7 20.0 64 31
13 Arasan 23.4 4cpu : 3453.7 38.2 30.0 96 31
14 Seer 2.5.0 4cpu : 3435.8 49.9 18.5 64 29
15 Houdini 6 4cpu : 3386.0 58.2 15.0 64 23
16 Tucano 10.00 4cpu : 3345.3 64.0 12.5 64 20
White advantage = 6.15 +/- 5.31
Draw rate (equal opponents) = 97.16 % +/- 1.60As well, the head-to-head listings are included below.
There was a recent TCEC bonus in which the Top 3 played the starting position against a gauntlet of 40 weaker engines. To keep the tension high the encounters were more or less in ascending order. One observation is that is takes about a rating of 3450 (Igel 3.1.4) to lay claim to having "conquered" the standard opening position.
A second observation was that Komodo showed itself move effective at beating up on weaker opponents than either Stockfish or LCZero. But as the goings got tougher Stockfish reasserted its dominance. This pattern is also expressed in the CCRL match-ups. This would seem to be part of the explanation as to why the overall separation is small. The sum total is a balance between long-distance performance, somewhat favoring Komodo, and up-close results somewhat favoring Stockfish.
Estimating ratings of the top floors of the skyscraper is a hazardous endeavor. Not until an engine is surrounded by close cohorts above and below do the ratings lock in.
Code: Select all
1) Stockfish 15 1cpu 3621.2 : 932 (+331,=599,-2) 67.7%
vs. : games ( +, =, -) Draw Perc Perf : Diff SD LOS
SugaR AI 2.50 1cpu : 52 ( 1, 51, 0) 98.1 51.0 +6.7 : +1.8 9.3 57.9
KomodoDragon 3.1 1cpu : 56 ( 5, 51, 0) 91.1 54.5 +31.1 : +3.7 8.3 67.3
Fat Fritz 2 1cpu : 60 ( 5, 55, 0) 91.7 54.2 +29.0 : +26.0 10.4 99.4
Ethereal 13.75 1cpu : 56 ( 14, 42, 0) 75.0 62.5 +88.7 : +78.1 15.0 100.0
Revenge 3.0 1cpu : 56 ( 13, 43, 0) 76.8 61.6 +82.2 : +88.0 15.6 100.0
SlowChess 2.9 1cpu : 65 ( 23, 42, 0) 64.6 67.7 +128.5 : +118.8 17.3 100.0
Koivisto 8.0 1cpu : 65 ( 22, 42, 1) 64.6 66.2 +116.4 : +135.3 17.7 100.0
Berserk 9 1cpu : 60 ( 28, 32, 0) 53.3 73.3 +175.7 : +145.3 19.3 100.0
RubiChess 20220223 1cpu : 65 ( 27, 37, 1) 56.9 70.0 +147.2 : +152.3 18.8 100.0
RofChade 3.0 1cpu : 119 ( 54, 65, 0) 54.6 72.7 +170.1 : +171.7 16.4 100.0
Seer 2.5.0 1cpu : 66 ( 29, 37, 0) 56.1 72.0 +163.8 : +175.6 19.4 100.0
Arasan 23.4 1cpu : 48 ( 21, 27, 0) 56.2 71.9 +163.0 : +188.9 20.1 100.0
Minic 3.24 1cpu : 56 ( 26, 30, 0) 53.6 73.2 +174.7 : +198.7 22.8 100.0
Rebel 15.1 1cpu : 56 ( 28, 28, 0) 50.0 75.0 +190.8 : +207.6 23.0 100.0
Houdini 6 1cpu : 52 ( 35, 17, 0) 32.7 83.7 +283.6 : +294.2 30.6 100.0
3) KomodoDragon 3.1 1cpu 3617.5 : 870 (+306,=556,-8) 67.1%
vs. : games ( +, =, -) Draw Perc Perf : Diff SD LOS
Stockfish 15 1cpu : 56 ( 0, 51, 5) 91.1 45.5 -31.1 : -3.7 8.3 32.7
SugaR AI 2.50 1cpu : 52 ( 0, 51, 1) 98.1 49.0 -6.7 : -1.8 9.4 42.2
Fat Fritz 2 1cpu : 56 ( 5, 49, 2) 87.5 52.7 +18.6 : +22.3 9.9 98.8
Ethereal 13.75 1cpu : 56 ( 10, 46, 0) 82.1 58.9 +62.7 : +74.4 15.4 100.0
Revenge 3.0 1cpu : 56 ( 14, 42, 0) 75.0 62.5 +88.7 : +84.3 15.5 100.0
SlowChess 2.9 1cpu : 56 ( 16, 40, 0) 71.4 64.3 +102.1 : +115.1 17.6 100.0
Koivisto 8.0 1cpu : 56 ( 23, 33, 0) 58.9 70.5 +151.6 : +131.6 17.5 100.0
Berserk 9 1cpu : 56 ( 17, 39, 0) 69.6 65.2 +108.9 : +141.6 19.7 100.0
RubiChess 20220223 1cpu : 56 ( 23, 33, 0) 58.9 70.5 +151.6 : +148.6 18.7 100.0
RofChade 3.0 1cpu : 56 ( 25, 31, 0) 55.4 72.3 +166.8 : +168.0 16.9 100.0
Seer 2.5.0 1cpu : 56 ( 27, 29, 0) 51.8 74.1 +182.7 : +171.9 19.6 100.0
Arasan 23.4 1cpu : 102 ( 52, 50, 0) 49.0 75.5 +195.4 : +185.2 19.2 100.0
Minic 3.24 1cpu : 52 ( 29, 23, 0) 44.2 77.9 +218.7 : +195.0 22.7 100.0
Rebel 15.1 1cpu : 52 ( 29, 23, 0) 44.2 77.9 +218.7 : +203.9 23.0 100.0
Houdini 6 1cpu : 52 ( 36, 16, 0) 30.8 84.6 +296.1 : +290.5 30.9 100.0
Code: Select all
1) Stockfish 15 4cpu 3598.2 : 550 (+165,=384,-1) 64.9%
vs. : games ( +, =, -) Draw Perc Perf : Diff SD LOS
KomodoDragon 3.1 4cpu : 32 ( 2, 30, 0) 93.8 53.1 +21.7 : +9.3 9.7 83.2
SugaR AI 2.50 4cpu : 32 ( 1, 31, 0) 96.9 51.6 +10.9 : +10.2 9.9 84.9
Fat Fritz 2 4cpu : 32 ( 2, 30, 0) 93.8 53.1 +21.7 : +21.1 10.9 97.3
Revenge 3.0 4cpu : 32 ( 4, 28, 0) 87.5 56.2 +43.7 : +48.7 16.2 99.9
Berserk 9 4cpu : 32 ( 5, 26, 1) 81.2 56.2 +43.7 : +65.5 18.1 100.0
Ethereal 13.75 4cpu : 32 ( 8, 24, 0) 75.0 62.5 +88.7 : +76.9 18.8 100.0
SlowChess 2.9 4cpu : 32 ( 7, 25, 0) 78.1 60.9 +77.2 : +82.6 19.6 100.0
Koivisto 8.0 4cpu : 50 ( 16, 34, 0) 68.0 66.0 +115.2 : +109.4 19.8 100.0
RubiChess 20220223 4cpu : 84 ( 31, 53, 0) 63.1 68.5 +134.6 : +142.6 18.7 100.0
Igel 3.1.0 4cpu : 32 ( 11, 21, 0) 65.6 67.2 +124.5 : +142.9 24.5 100.0
RofChade 3.0 4cpu : 32 ( 16, 16, 0) 50.0 75.0 +190.8 : +142.9 25.5 100.0
Arasan 23.4 4cpu : 32 ( 14, 18, 0) 56.2 71.9 +163.0 : +144.5 20.9 100.0
Seer 2.5.0 4cpu : 32 ( 13, 19, 0) 59.4 70.3 +149.8 : +162.5 26.8 100.0
Houdini 6 4cpu : 32 ( 17, 15, 0) 46.9 76.6 +205.6 : +212.2 32.0 100.0
Tucano 10.00 4cpu : 32 ( 18, 14, 0) 43.8 78.1 +221.1 : +252.9 34.8 100.0
2) KomodoDragon 3.1 4cpu 3588.9 : 512 (+140,=370,-2) 63.5%
vs. : games ( +, =, -) Draw Perc Perf : Diff SD LOS
Stockfish 15 4cpu : 32 ( 0, 30, 2) 93.8 46.9 -21.7 : -9.3 9.7 16.8
SugaR AI 2.50 4cpu : 32 ( 0, 32, 0) 100.0 50.0 +0.0 : +0.8 9.6 53.4
Fat Fritz 2 4cpu : 32 ( 1, 31, 0) 96.9 51.6 +10.9 : +11.8 10.8 86.1
Revenge 3.0 4cpu : 32 ( 4, 28, 0) 87.5 56.2 +43.7 : +39.4 16.1 99.3
Berserk 9 4cpu : 32 ( 7, 25, 0) 78.1 60.9 +77.2 : +56.2 18.1 99.9
Ethereal 13.75 4cpu : 32 ( 5, 27, 0) 84.4 57.8 +54.7 : +67.6 19.1 100.0
SlowChess 2.9 4cpu : 32 ( 7, 25, 0) 78.1 60.9 +77.2 : +73.3 19.7 100.0
Koivisto 8.0 4cpu : 32 ( 8, 24, 0) 75.0 62.5 +88.7 : +100.0 20.2 100.0
RubiChess 20220223 4cpu : 32 ( 13, 19, 0) 59.4 70.3 +149.8 : +133.2 19.3 100.0
Igel 3.1.0 4cpu : 32 ( 13, 19, 0) 59.4 70.3 +149.8 : +133.6 24.7 100.0
RofChade 3.0 4cpu : 32 ( 8, 24, 0) 75.0 62.5 +88.7 : +133.6 25.5 100.0
Arasan 23.4 4cpu : 64 ( 22, 42, 0) 65.6 67.2 +124.5 : +135.1 20.1 100.0
Seer 2.5.0 4cpu : 32 ( 14, 18, 0) 56.2 71.9 +163.0 : +153.1 27.1 100.0
Houdini 6 4cpu : 32 ( 17, 15, 0) 46.9 76.6 +205.6 : +202.9 31.9 100.0
Tucano 10.00 4cpu : 32 ( 21, 11, 0) 34.4 82.8 +273.2 : +243.6 35.0 100.0
-
jkominek
- Posts: 98
- Joined: Tue Sep 04, 2018 5:33 am
- Full name: John Kominek
Re: The Stockfish ELO problem
That is true. I read that part but failed to mention it here. Thank you for pointing it out.Modern Times wrote: ↑Sat Sep 03, 2022 1:25 amThat is based on a Pentium 90, his actual time control is 40 moves/125” or 40/130”, so 40 moves in just over 2 minutes
Time for each match has been fixed to 40 moves/120 minutes repeated, calibrated on a Pentium 90 processing power. The processing power has been emulated, after estimation by using benchmarks with real P90 results. Accordingly, on modern PC the effective match time was fixed to 40 moves/125” or 40/130” depending on the PC
There is not an abundance of open source data that I am aware of for calibrating against human performance. Or at least: under well controlled circumstances, with players not temped to hit a quick "I resign" button to start a new game. I recall that SSDF in the early year put effort into calibrating their list. I recently went searching for their notes on human calibration experiments but could not find it. My memory says it was based on Swedish club players in the 1500-2200 range going up against, for the most part, dedicated boards.