Experimental testruns of Stockfish / Torch

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Posts: 2663
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Experimental testruns of Stockfish / Torch

Post by pohl4711 »

I made a re-testrun of Stockfish 16. Goal was to measure, if Stockfish scores measureable weaker versus weaker opponents than Torch 2 (that was my prediction, because of the much higher EAS-scoring of Stockfish 16 (compared to Torch 2) - if you play very aggressive, you will lose some points more vs weaker opponents (for example, when a risky sacrifice goes wrong))

Here, the testing results:

Celo-Gap Stockfish 16/Torch 2:

strongest 5 opponents list : 32 Celo
Full 15 opponents list : 20 Celo
weakest 5 opponents list(s): 9 Celo

The effect, I mentioned, that Stockfish scores weaker versus weaker opponents than Torch, can be seen here, very clearly. That means, I underestimated the effect, when looking at my full UHO-ratinglist, where Torch 2 is 3 Elo ahead of Stockfish 16. Because in this full UHO-ratinglist, Stockfish has played 40000 games and Torch 2 only 24000. So, Stockfish played versus way more weaker engines, than Torch 2 did. So, the rating of SF 16 in my full ratinglist is weaker here, than in my experiments below.

And we learn, that a ratinglist, which is not a RoundRobin tournament (all engines have the same opponents) is very susceptible to distortions (another bad news for CCRL/CEGT). Especially, when engines with a high EAS-scoring, participate. Luckily, my UHO-Top15 Ratinglist is a RoundRobin, but my full UHO-ratinglist, where all played games/engines are collected, can be affected by this effect, too (see above).

Code: Select all

     Program                    Elo    +    -  Games    Score   Av.Op. Draws

   1 Stockfish 16.1 240224    : 3833    4    4 15000    71.0%   3670   47.9%
   2 Stockfish 16 230630      : 3821    4    4 15000    69.5%   3670   48.0%
   3 Torch 2 popavx2          : 3801    4    4 15000    67.0%   3672   48.1%
   4 Berserk 13 avx2          : 3747    4    4 15000    59.7%   3675   48.9%
   5 KomodoDragon 3.3 avx2    : 3735    4    4 15000    58.0%   3676   49.7%
   6 Ethereal 14.38 avx2      : 3699    4    4 15000    52.9%   3678   49.1%
   7 Obsidian 12.0 avx2       : 3693    4    4 15000    52.0%   3679   50.7%
   8 Caissa 1.18 avx2         : 3675    4    4 15000    49.4%   3680   49.0%
   9 RubiChess 240112 avx2    : 3652    4    4 15000    46.2%   3682   48.7%
  10 PlentyChess 1.0 avx2     : 3630    4    4 15000    43.0%   3683   50.1%
  11 Alexandria 6.1.0 avx2    : 3609    4    4 15000    40.0%   3684   49.5%
  12 Seer 2.8.0 avx2          : 3605    4    4 15000    39.4%   3685   48.9%
  13 CSTal 2.0 avx2           : 3597    4    4 15000    38.4%   3685   51.1%
  14 Uralochka 3.41a avx2     : 3596    4    4 15000    38.2%   3685   48.3%
  15 Rebel 16.3 avx2          : 3595    4    4 15000    38.0%   3685   49.9%
  16 Titan 1.0 avx2           : 3590    4    4 15000    37.3%   3686   49.7%

Games        : 120000 (finished)

White Wins   : 58277 (48.6 %)
Black Wins   : 2651 (2.2 %)
Draws        : 59072 (49.2 %)
5 strongest engines (opponents):

Code: Select all

     Program                    Elo    +    -  Games    Score   Av.Op. Draws

   1 Stockfish 16.1 240224    : 3833    6    6  5000    63.7%   3732   49.9%
   2 Stockfish 16 230630      : 3813    6    6  5000    60.5%   3736   49.9%
   3 Torch 2 popavx2          : 3781    6    6  5000    55.4%   3742   49.9%
   4 Berserk 13 avx2          : 3707    6    6  5000    43.2%   3757   50.7%
   5 KomodoDragon 3.3 avx2    : 3698    6    6  5000    41.7%   3759   50.8%
   6 Ethereal 14.38 avx2      : 3659    6    6  5000    35.4%   3767   49.9%

Games        : 15000 (finished)

White Wins   : 7322 (48.8 %)
Black Wins   : 148 (1.0 %)
Draws        : 7530 (50.2 %)
5 weakest engines (opponents for SF / Torch 2):
Stockfish 16:

Code: Select all

     Program                   Elo    +    -  Games    Score   Av.Op. Draws

   1 Stockfish 16 230630     : 3783    6    6  5000    75.0%   3590   47.4%
   2 Seer 2.8.0 avx2         : 3600    6    6  5000    46.6%   3627   49.0%
   3 CSTal 2.0 avx2          : 3594    6    6  5000    45.5%   3628   52.6%
   4 Uralochka 3.41a avx2    : 3589    6    6  5000    44.8%   3629   49.4%
   5 Rebel 16.3 avx2         : 3587    6    6  5000    44.5%   3629   50.6%
   6 Titan 1.0 avx2          : 3582    6    6  5000    43.6%   3630   50.7%

Games        : 15000 (finished)

White Wins   : 7216 (48.1 %)
Black Wins   : 290 (1.9 %)
Draws        : 7494 (50.0 %)
Torch 2:

Code: Select all

     Program                   Elo    +    -  Games    Score   Av.Op. Draws

   1 Torch 2 popavx2         : 3774    7    7  5000    74.1%   3590   46.9%
   2 Seer 2.8.0 avx2         : 3600    6    6  5000    46.8%   3625   48.6%
   3 CSTal 2.0 avx2          : 3594    6    6  5000    45.8%   3626   52.5%
   4 Uralochka 3.41a avx2    : 3589    6    6  5000    44.9%   3628   49.5%
   5 Rebel 16.3 avx2         : 3587    6    6  5000    44.6%   3628   50.4%
   6 Titan 1.0 avx2          : 3582    6    6  5000    43.9%   3629   50.8%

Games        : 15000 (finished)

White Wins   : 7254 (48.4 %)
Black Wins   : 279 (1.9 %)
Draws        : 7467 (49.8 %)
Posts: 901
Joined: Thu Aug 11, 2022 11:30 pm
Full name: Esmeralda Pinto

Re: Experimental testruns of Stockfish / Torch

Post by chessica »

Great, nobody has ever seen this engine, there is no download link. What can I do with this posting? Nothing.
Dann Corbit
Posts: 12743
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Experimental testruns of Stockfish / Torch

Post by Dann Corbit »

Personally, I think it is very exciting that there is a new engine that can challenge Stockfish and LC0.
I guess they are not giving it away because it will become a commercial engine.
I am fine with that, it is like Komodo and other interesting developments.
Competing against the Stockfish and LC0 teams is a truly daunting task.
I salute them for trying.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Posts: 5685
Joined: Wed Sep 05, 2018 2:16 am
Location: Moving
Full name: Jorge Picado

Re: Experimental testruns of Stockfish / Torch

Post by Chessqueen »

Dann Corbit wrote: Tue May 28, 2024 10:24 am
Personally, I think it is very exciting that there is a new engine that can challenge Stockfish and LC0.
I believe that they are NOT giving it away so nobody can confirm their claim of being stronger than Stockfish :roll: :mrgreen:
Posts: 901
Joined: Thu Aug 11, 2022 11:30 pm
Full name: Esmeralda Pinto

Re: Experimental testruns of Stockfish / Torch

Post by chessica »

Du schreibst hier unter anderem unter diesem Link:

https://forum.computerschach.de/cgi-bin ... #pid169921


Zum Glück ist meine Rangliste von diesen Problemen nicht betroffen. Denn bei mir wird ohne Aufgabe oder Remis durch die GUI gespielt. Nur, wenn 5 Steine auf dem Brett erreicht sind, wird anhand der Tablebases die Partie gewertet und beendet. Und ganz frühe Remisen sind bei mir kein Problem, weil ich ja mit UHO Eröffnungen teste: Da Weiß am Ende der UHO-Eröffnungsvorgabe immer meßbar besser steht, läßt sich natürlich keine Engine mit Weiß auf ein frühes 3fach-Remis oder Dauerschach ein (einer der vielen Vorteile von UHO Eröffnungen...), es sei denn, die Weiß spielende Engine vergeigt den Vorgabevorteil schon in den ersten selbstberechneten Zügen - und das kommt nur sehr selten vor.

Fortunately, my ranking list is not affected by these problems. I play through the GUI without resignation or draws. Only when 5 pieces are on the board is the game evaluated and ended based on the table bases. And very early draws are not a problem for me because I test with UHO openings: Since White is always measurably better at the end of the UHO opening handicap, no engine with White will of course agree to an early 3-way draw or perpetual check (one of the many advantages of UHO openings...), unless the engine playing White blows the handicap advantage in the first self-calculated moves - and that only happens very rarely.

That's a big sticking point, the engines should all play the games themselves to the end, ideally without EGTBs and in such a way that you take the engines as they are made available. Your lists are not clear.

And the Elo evaluations not only require a lot of games, but also a lot of engines. Paradoxically, you are doing the opposite.
Dann Corbit
Posts: 12743
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Experimental testruns of Stockfish / Torch

Post by Dann Corbit »

Chessqueen wrote: Wed May 29, 2024 3:05 am
Dann Corbit wrote: Tue May 28, 2024 10:24 am
Personally, I think it is very exciting that there is a new engine that can challenge Stockfish and LC0.
I believe that they are NOT giving it away so nobody can confirm their claim of being stronger than Stockfish :roll: :mrgreen:
Same problem with every private engine. For example Ferret, by Bruce Moreland.
Considering who is working on it, I guess that it is very, very strong.
But I agree that we cannot know the real strength based on a private test.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Albert Silver
Posts: 3026
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: Experimental testruns of Stockfish / Torch

Post by Albert Silver »

Dann Corbit wrote: Wed May 29, 2024 3:04 pm
Chessqueen wrote: Wed May 29, 2024 3:05 am
Dann Corbit wrote: Tue May 28, 2024 10:24 am
Personally, I think it is very exciting that there is a new engine that can challenge Stockfish and LC0.
I believe that they are NOT giving it away so nobody can confirm their claim of being stronger than Stockfish :roll: :mrgreen:
Same problem with every private engine. For example Ferret, by Bruce Moreland.
Considering who is working on it, I guess that it is very, very strong.
But I agree that we cannot know the real strength based on a private test.
True, but I actually tested Ferret, though not using the current methods set out by Vas. It was about Fritz 5 in strength, a bit behind Fritz 6 when I tested (per his own words). To simplify, it was, much like Torch, about head to head with the known no.1 Fritz 5, which is no small feat since Fritz 32-bit was far more ahead of the field than SF is now. But, again, not using the standards of testing we conduct nowadays. This was in the days of the SSDF.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
Robert Flesher
Posts: 1284
Joined: Tue Aug 18, 2009 3:06 am

Re: Experimental testruns of Stockfish / Torch

Post by Robert Flesher »

Dann Corbit wrote: Tue May 28, 2024 10:24 am Personally, I think it is very exciting that there is a new engine that can challenge Stockfish and LC0.
I guess they are not giving it away because it will become a commercial engine.
I am fine with that, it is like Komodo and other interesting developments.
Competing against the Stockfish and LC0 teams is a truly daunting task.
I salute them for trying.
Hell, I'll buy it right now. Just send me the link :)
User avatar
Posts: 2663
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: Experimental testruns of Stockfish / Torch

Post by pohl4711 »

pohl4711 wrote: Mon May 27, 2024 10:53 am I made a re-testrun of Stockfish 16. Goal was to measure, if Stockfish scores measureable weaker versus weaker opponents than Torch 2 (that was my prediction, because of the much higher EAS-scoring of Stockfish 16 (compared to Torch 2) - if you play very aggressive, you will lose some points more vs weaker opponents (for example, when a risky sacrifice goes wrong))

Here, the testing results:

Celo-Gap Stockfish 16/Torch 2:

strongest 5 opponents list : 32 Celo
Full 15 opponents list : 20 Celo
weakest 5 opponents list(s): 9 Celo

Andrew Grant suggested, to rescore all lists with my Gamepair-rescoring Tool and look at the Celo-Gaps. I did this and the result is quite surprising:

Strongest 5 Ratinglist: 75 Gamepair-Celo
Full 15 opponents Ratinglist : 79 Gamepair-Celo
Weakest 5 opponents Ratinglist: 73 Gamepair-Celo

I always was convinced, that Gamepair-rescoring is the best way to evaluate UHO-enginegames. But now, I am more convinced than ever... No Celo-compression/distortion here. Nearly the same result (with some random error) in all 3 lists. Wow!
Posts: 901
Joined: Thu Aug 11, 2022 11:30 pm
Full name: Esmeralda Pinto

Re: Experimental testruns of Stockfish / Torch

Post by chessica »

pohl4711 wrote: Thu May 30, 2024 8:27 am
pohl4711 wrote: Mon May 27, 2024 10:53 am I made a re-testrun of Stockfish 16. Goal was to measure, if Stockfish scores measureable weaker versus weaker opponents than Torch 2 (that was my prediction, because of the much higher EAS-scoring of Stockfish 16 (compared to Torch 2) - if you play very aggressive, you will lose some points more vs weaker opponents (for example, when a risky sacrifice goes wrong))

Here, the testing results:

Celo-Gap Stockfish 16/Torch 2:

strongest 5 opponents list : 32 Celo
Full 15 opponents list : 20 Celo
weakest 5 opponents list(s): 9 Celo

Andrew Grant suggested, to rescore all lists with my Gamepair-rescoring Tool and look at the Celo-Gaps. I did this and the result is quite surprising:

Strongest 5 Ratinglist: 75 Gamepair-Celo
Full 15 opponents Ratinglist : 79 Gamepair-Celo
Weakest 5 opponents Ratinglist: 73 Gamepair-Celo

I always was convinced, that Gamepair-rescoring is the best way to evaluate UHO-enginegames. But now, I am more convinced than ever... No Celo-compression/distortion here. Nearly the same result (with some random error) in all 3 lists. Wow!
Aha? Finally got it? I like it much better now.