Stockfish 2.3.1 weaker than 2.2.2?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Joerg Oster
Posts: 691
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Joerg Oster » Tue Oct 02, 2012 9:18 pm

Marco,

both gauntlets are finished. Time control was 90+0.5 sec, 512 MB Hash for each engine, no EGTBs. (500 start positions randomly taken with the tool by Volker Annuss from Frank Quisinsky's SWCR Opening Database 4.1)
Here the outputs of cutechess-cli.

Stockfish 2.3.1 default:

Code: Select all

Rank Name                        ELO   Games   Score   Draws
   1 Stockfish                   138     500     69%     35%
   2 Critter1.6                   28     100     54%     50%
   3 Komodo3                     -28     100     46%     52%
   4 Hiarcs13.2                 -168     100     28%     31%
   5 Rybka4.1                   -241     100     20%     28%
   6 Gaviota0.84                -424     100      8%     14%
Finished match

*
Number of Draws by stalemate:              6
Number of Draws by insufficient material: 31
Number of Draws by 50 moves rule:         46
Number of Draws by 3-fold repetition:     92
Stockfish 2.3.1 50MR:

Code: Select all

Rank Name                        ELO   Games   Score   Draws
   1 Stockfish-50MR              144     500     70%     29%
   2 Critter1.6                   85     100     62%     48%
   3 Komodo3                     -17     100     48%     43%
   4 Hiarcs13.2                 -191     100     25%     30%
   5 Rybka4.1                   -301     100     15%     18%
   6 Gaviota0.84                -636     100      2%      5%
Finished match

*
Number of Draws by stalemate:              3
Number of Draws by insufficient material: 29
Number of Draws by 50 moves rule:         26
Number of Draws by 3-fold repetition:     86
* Draw satistics added by me

Please notice the lower draw rate for the modified version, especially due to 50-moves rule.
Overall, it looks like a small win, though it seems to work better against weaker engines. Cutechess gives +6, Ordo gives +7.

All games available on demand. Hope this test was helpful. :D

Joerg
Jörg Oster

gladius
Posts: 538
Joined: Tue Dec 12, 2006 9:10 am

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by gladius » Tue Oct 02, 2012 10:07 pm

Thanks for running the test Joerg! The change looks quite promising. The results against Rybka 4.1 seem a bit off though. Can you post a game or two from that series?

User avatar
lucasart
Posts: 3041
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by lucasart » Tue Oct 02, 2012 11:16 pm

Joerg Oster wrote:Marco,

both gauntlets are finished. Time control was 90+0.5 sec, 512 MB Hash for each engine, no EGTBs. (500 start positions randomly taken with the tool by Volker Annuss from Frank Quisinsky's SWCR Opening Database 4.1)
Here the outputs of cutechess-cli.

Stockfish 2.3.1 default:

Code: Select all

Rank Name                        ELO   Games   Score   Draws
   1 Stockfish                   138     500     69%     35%
   2 Critter1.6                   28     100     54%     50%
   3 Komodo3                     -28     100     46%     52%
   4 Hiarcs13.2                 -168     100     28%     31%
   5 Rybka4.1                   -241     100     20%     28%
   6 Gaviota0.84                -424     100      8%     14%
Finished match

*
Number of Draws by stalemate:              6
Number of Draws by insufficient material: 31
Number of Draws by 50 moves rule:         46
Number of Draws by 3-fold repetition:     92
Stockfish 2.3.1 50MR:

Code: Select all

Rank Name                        ELO   Games   Score   Draws
   1 Stockfish-50MR              144     500     70%     29%
   2 Critter1.6                   85     100     62%     48%
   3 Komodo3                     -17     100     48%     43%
   4 Hiarcs13.2                 -191     100     25%     30%
   5 Rybka4.1                   -301     100     15%     18%
   6 Gaviota0.84                -636     100      2%      5%
Finished match

*
Number of Draws by stalemate:              3
Number of Draws by insufficient material: 29
Number of Draws by 50 moves rule:         26
Number of Draws by 3-fold repetition:     86
* Draw satistics added by me

Please notice the lower draw rate for the modified version, especially due to 50-moves rule.
Overall, it looks like a small win, though it seems to work better against weaker engines. Cutechess gives +6, Ordo gives +7.

All games available on demand. Hope this test was helpful. :D

Joerg
if you have kept the PGN, you can throw that into BayesElo, and look at the LOS. Co;paring the 2 'elos' given by cutechess-cli in this way can be misleading. Better to mix all your PGNs (even the head to head match you did before) and look at a LOS matrix from there.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Joerg Oster
Posts: 691
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Joerg Oster » Wed Oct 03, 2012 9:15 am

Hi Gary,

here 2 games of Rybka, first a win:
  • [Event "?"]
    [Site "?"]
    [Date "2012.10.01"]
    [Round "8"]
    [White "Rybka4.1"]
    [Black "Stockfish"]
    [Result "1-0"]
    [PlyCount "123"]
    [TimeControl "90+0.5"]

    1. c4 {book} Nf6 {book} 2. Nc3 {book} e6 {book} 3. e4 {book} d5 {book}
    4. e5 {book} Ne4 {book} 5. Nf3 {book} Be7 {book} 6. Qc2 {book} Ng5 {book}
    7. Nxg5 {book} Bxg5 {book} 8. cxd5 {book} exd5 {book} 9. d4 {+0.37/10 5.9s}
    Bxc1 {-0.84/17 2.9s} 10. Rxc1 {+0.45/11 2.8s} O-O {-0.72/17 2.8s}
    11. Bd3 {+0.45/11 5.7s} Nc6 {-0.56/18 8.6s} 12. O-O {+0.38/10 4.9s}
    Nxd4 {0.00/18 2.4s} 13. Bxh7+ {+0.30/10 1.8s} Kh8 {0.00/12 0.008s}
    14. Qd1 {+0.30/9 3.6s} Kxh7 {0.00/19 3.6s} 15. Qxd4 {+0.30/9 3.6s}
    c6 {0.00/18 3.0s} 16. Rfe1 {+0.28/8 1.6s} Rh8 {+0.08/16 3.1s}
    17. Ne2 {+0.18/9 3.8s} Kg8 {+0.12/16 3.5s} 18. Qd2 {+0.25/9 1.9s}
    Bd7 {+0.16/18 7.9s} 19. h3 {+0.28/10 2.6s} Qe7 {+0.16/17 4.1s}
    20. Nd4 {+0.34/9 3.3s} Re8 {+0.24/16 2.2s} 21. b4 {+0.31/9 6.3s}
    Qh4 {+0.12/16 3.3s} 22. Rc3 {+0.44/10 2.0s} Rh5 {+0.08/17 2.3s}
    23. f4 {+0.58/10 3.1s} Qe7 {+0.04/18 2.4s} 24. a3 {+0.62/10 4.4s}
    a5 {+0.08/16 4.3s} 25. bxa5 {+0.73/10 4.0s} c5 {-0.08/15 2.0s}
    26. Nb3 {+0.73/9 2.5s} c4 {0.00/16 2.8s} 27. Nd4 {+0.73/9 2.4s}
    Qc5 {-0.20/18 6.2s} 28. Rb1 {+0.63/10 3.9s} Bc6 {-0.44/19 6.0s}
    29. Rg3 {+0.83/9 4.1s} Kh8 {-0.40/16 5.2s} 30. Kh2 {+0.97/7 1.5s}
    Kg8 {-1.29/15 3.4s} 31. Qc3 {+1.33/7 0.89s} Rh7 {-0.96/14 2.0s}
    32. f5 {+1.35/8 1.4s} Rc8 {-2.02/15 3.1s} 33. e6 {+2.17/8 3.6s}
    Qd6 {-2.74/15 2.0s} 34. exf7+ {+2.43/8 3.4s} Kxf7 {-3.31/15 2.0s}
    35. Ne6 {+2.46/8 0.86s} Kg8 {-4.36/16 2.4s} 36. Nxg7 {+2.66/10 3.2s}
    Kf7 {-5.10/16 3.8s} 37. f6 {+2.64/10 1.2s} Qf4 {-5.61/16 1.6s}
    38. Rb2 {+2.64/10 2.2s} Rf8 {-5.41/15 2.8s} 39. Qe1 {+6.48/7 2.8s}
    Qd6 {-7.43/13 0.85s} 40. Re2 {+8.90/7 1.5s} Bd7 {-10.82/14 0.87s}
    41. Re7+ {+11.52/7 1.3s} Kxf6 {-14.66/16 0.61s} 42. Rxd7 {+11.52/8 1.4s}
    Qxd7 {-20.28/17 1.0s} 43. Qf2+ {+12.13/9 2.0s} Ke7 {-22.74/18 0.91s}
    44. Re3+ {+12.94/9 1.2s} Kd6 {-25.97/17 0.53s} 45. Qxf8+ {+12.94/9 1.2s}
    Kc6 {-92.85/17 1.7s} 46. Re6+ {+12.94/8 1.5s} Qxe6 {-35.95/19 0.85s}
    47. Nxe6 {+13.66/11 0.55s} Kd7 {-94.05/18 1.6s} 48. Nc5+ {+16.79/8 1.2s}
    Kc7 {-94.20/18 1.9s} 49. Qf5 {+16.79/8 1.1s} Kd6 {-98.53/14 0.56s}
    50. Qg6+ {+16.79/6 1.1s} Ke5 {-105.68/14 1.2s} 51. Qxh7 {+16.79/6 1.1s}
    d4 {-105.68/10 0.61s} 52. Qxb7 {+17.87/5 0.97s} d3 {-105.73/9 0.49s}
    53. Qe4+ {+27.37/6 1.6s} Kf6 {-M12/13 0.43s} 54. Qxc4 {+27.75/6 0.81s}
    d2 {-M16/9 0.58s} 55. Qd4+ {+298.41/7 0.62s} Ke7 {-M12/13 0.35s}
    56. Qd7+ {+M13/7 0.67s} Kf6 {-M12/15 0.45s} 57. Qxd2 {+M11/8 1.1s}
    Ke5 {-M10/19 0.45s} 58. a6 {+M9/7 0.99s} Kf6 {-M8/53 0.43s} 59. a7 {+M7/9 0.36s}
    Ke5 {-M6/100 0.044s} 60. a8=Q {+M5/15 0.50s} Kf6 {-M4/100 0.004s}
    61. Qf8+ {+M3/37 0.35s} Ke5 {-M2/100 0.001s}
    62. Qff4# {+M1/50 0.49s, White mates} 1-0
then a loss:
  • [Event "?"]
    [Site "?"]
    [Date "2012.10.01"]
    [Round "56"]
    [White "Rybka4.1"]
    [Black "Stockfish"]
    [Result "0-1"]
    [PlyCount "114"]
    [TimeControl "90+0.5"]

    1. e4 {book} c5 {book} 2. Nc3 {book} d6 {book} 3. f4 {book} g6 {book}
    4. Nf3 {book} Bg7 {book} 5. Bc4 {book} e6 {book} 6. d4 {book} cxd4 {book}
    7. Nxd4 {book} Nf6 {book} 8. Ndb5 {book} d5 {book} 9. exd5 {+0.16/10 6.4s}
    a6 {-0.24/17 3.2s} 10. dxe6 {+0.15/10 7.0s} Qxd1+ {0.00/12 0.001s}
    11. Kxd1 {+0.15/10 3.9s} axb5 {0.00/12 0.009s} 12. exf7+ {+0.01/10 5.0s}
    Ke7 {-0.24/18 4.0s} 13. Re1+ {0.00/9 3.7s} Kf8 {+0.48/17 1.7s}
    14. Nxb5 {-0.05/10 1.3s} Bg4+ {+0.48/12 0.001s} 15. Be2 {0.00/10 1.4s}
    Ra6 {+0.44/18 2.7s} 16. h3 {-0.10/10 2.9s} Bxe2+ {+0.48/20 3.9s}
    17. Kxe2 {-0.10/11 3.4s} Re6+ {+0.44/21 4.0s} 18. Be3 {+0.07/10 3.3s}
    Nd5 {+0.48/23 4.3s} 19. Kf3 {+0.10/11 1.2s} Nxe3 {+0.36/20 2.5s}
    20. Rxe3 {+0.10/11 3.2s} Rxe3+ {+0.48/20 3.5s} 21. Kxe3 {-0.05/12 3.0s}
    Kxf7 {+0.44/22 2.8s} 22. c3 {-0.05/12 6.3s} Rd8 {+0.60/21 2.6s}
    23. Re1 {-0.05/11 1.1s} Na6 {+0.72/19 5.4s} 24. Kf3 {-0.29/12 10s}
    Nc5 {+0.92/19 2.7s} 25. b3 {-0.29/11 2.4s} b6 {+0.92/19 2.4s}
    26. Re3 {-0.13/9 1.9s} Rd2 {+0.92/19 2.8s} 27. Re2 {-0.14/11 2.1s}
    Rd1 {+1.21/18 4.9s} 28. c4 {-0.30/10 2.4s} Ne6 {+1.61/18 2.4s}
    29. Ke4 {-0.24/10 2.2s} Bf8 {+1.45/19 2.9s} 30. g3 {-0.60/10 1.5s}
    Bc5 {+1.33/18 2.2s} 31. Rg2 {-0.74/10 1.8s} Ng7 {+2.30/19 2.3s}
    32. Nc3 {-0.66/9 3.5s} Rd4+ {+3.47/20 5.0s} 33. Ke5 {-1.12/10 2.0s}
    Nf5 {+5.33/20 1.7s} 34. Ne4 {-2.54/11 1.0s} Bf8 {+5.73/21 1.7s}
    35. Ng5+ {-2.60/11 1.0s} Ke8 {+5.45/21 3.4s} 36. Ne6 {-2.72/10 2.7s}
    Bd6+ {+5.73/21 1.5s} 37. Kf6 {-2.72/10 0.027s} Be7+ {+5.61/21 1.6s}
    38. Ke5 {-2.72/10 0.030s} Rd1 {+5.97/21 1.9s} 39. Re2 {-2.72/9 1.8s}
    Nxg3 {+6.18/21 1.3s} 40. Re3 {-3.58/9 1.2s} Nf5 {+6.42/21 1.9s}
    41. Re4 {-4.25/9 0.99s} Bb4 {+7.91/19 1.1s} 42. Kf6 {-4.25/10 1.9s}
    Bc3+ {+8.44/20 2.0s} 43. Re5 {-4.55/11 1.6s} Rd6 {+8.48/21 1.5s}
    44. b4 {-5.85/9 1.6s} Kd7 {+8.56/21 1.9s} 45. Kf7 {-5.85/9 1.5s}
    Nh6+ {+10.44/21 5.2s} 46. Kg7 {-6.99/10 2.5s} Bxe5+ {+12.00/17 1.1s}
    47. Kxh6 {-6.99/10 1.4s} Kxe6 {+10.66/12 0.006s} 48. fxe5 {-6.99/10 1.3s}
    Rd7 {+10.66/12 0.001s} 49. h4 {-7.12/8 0.61s} Kxe5 {+16.48/17 0.88s}
    50. c5 {-7.16/9 0.61s} Kf6 {+M15/20 2.0s} 51. h5 {-M14/12 1.3s}
    g5 {+M13/27 1.5s} 52. c6 {-M16/11 0.83s} Rc7 {+M11/43 1.2s}
    53. b5 {-M10/21 0.85s} g4 {+M9/88 4.1s} 54. a3 {-M8/20 2.1s} g3 {+M7/100 0.017s}
    55. a4 {0.026s} g2 {+M5/100 0.008s} 56. a5 {0.031s} g1=Q {+M3/100 0.008s}
    57. axb6 {-M2/42 0.45s} Qc1# {+M1/100 0.003s, Black mates} 0-1
Too bad, we don't have a game viewer in this forum.

And yes, Rybka's performance looks a bit suspicious. But the games don't look that odd to me. Rybka simply reaches no high depths. Maybe the compile is not optimal for AMD Bulldozer CPU.

Joerg
Jörg Oster

Joerg Oster
Posts: 691
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Joerg Oster » Wed Oct 03, 2012 9:46 am

lucasart wrote:if you have kept the PGN, you can throw that into BayesElo, and look at the LOS. Co;paring the 2 'elos' given by cutechess-cli in this way can be misleading. Better to mix all your PGNs (even the head to head match you did before) and look at a LOS matrix from there.
First, bayeselo rating with the 2 gauntlets only:

Code: Select all

Rank Name             Elo    +    - games score oppo. draws 
   1 Critter1.6       165   33   33   200   58%   118   49% 
   2 Stockfish-50MR   124   25   24   500   70%   -47   29% 
   3 Stockfish        111   24   24   500   69%   -47   35% 
   4 Komodo3           98   33   33   200   47%   118   48% 
   5 Hiarcs13.2       -47   36   38   200   26%   118   31% 
   6 Rybka4.1        -128   40   43   200   18%   118   23% 
   7 Gaviota0.84     -323   57   68   200    5%   118   10% 
Then including the head-to-head match:

Code: Select all

Rank Name             Elo    +    - games score oppo. draws 
   1 Critter1.6       165   33   33   200   58%   118   49% 
   2 Stockfish-50MR   119   15   15  1000   60%    35   49% 
   3 Stockfish        117   15   15  1000   60%    36   52% 
   4 Komodo3           98   33   33   200   47%   118   48% 
   5 Hiarcs13.2       -47   36   38   200   26%   118   31% 
   6 Rybka4.1        -128   40   43   200   18%   118   23% 
   7 Gaviota0.84     -324   57   68   200    5%   118   10% 
I hope I did this right.
How do I get a LOS calculation?
Jörg Oster

User avatar
lucasart
Posts: 3041
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by lucasart » Wed Oct 03, 2012 11:17 am

Joerg Oster wrote:
lucasart wrote:if you have kept the PGN, you can throw that into BayesElo, and look at the LOS. Co;paring the 2 'elos' given by cutechess-cli in this way can be misleading. Better to mix all your PGNs (even the head to head match you did before) and look at a LOS matrix from there.
First, bayeselo rating with the 2 gauntlets only:

Code: Select all

Rank Name             Elo    +    - games score oppo. draws 
   1 Critter1.6       165   33   33   200   58%   118   49% 
   2 Stockfish-50MR   124   25   24   500   70%   -47   29% 
   3 Stockfish        111   24   24   500   69%   -47   35% 
   4 Komodo3           98   33   33   200   47%   118   48% 
   5 Hiarcs13.2       -47   36   38   200   26%   118   31% 
   6 Rybka4.1        -128   40   43   200   18%   118   23% 
   7 Gaviota0.84     -323   57   68   200    5%   118   10% 
Then including the head-to-head match:

Code: Select all

Rank Name             Elo    +    - games score oppo. draws 
   1 Critter1.6       165   33   33   200   58%   118   49% 
   2 Stockfish-50MR   119   15   15  1000   60%    35   49% 
   3 Stockfish        117   15   15  1000   60%    36   52% 
   4 Komodo3           98   33   33   200   47%   118   48% 
   5 Hiarcs13.2       -47   36   38   200   26%   118   31% 
   6 Rybka4.1        -128   40   43   200   18%   118   23% 
   7 Gaviota0.84     -324   57   68   200    5%   118   10% 
I hope I did this right.
How do I get a LOS calculation?
Yes, you did it right. This shows an improvement that is not significant. To get the los matrix, what you want is the LOS matrix of engines #1 and #2 (Critter = #0)

Code: Select all

readpgn ./games.pgn
elo
mm
exactdist
ratings
los 1 2
you'd probably get a LOS around 60% or so, far below the usual 95% confidence.

At least it proves that it's not a regression. So you may want to validate the change if you think it improves the playing style, or makes the code prettier or whatever non-statistical consideration :D

Note thqat it's important to include all the information available, and therefore both the head to head match and the second match against other engines. The point is that the two experiments:
A: SF new beats SF old in the head to head match (under *predetermined* conditions)
B: SF new beats SF old in the match against other engines (under *predetermined* conditions)

can be thought of as random variables taking values in {true, false, undecided at 95% confidence}.

You first observed the value of A = undecided, so you then *decided* to go and observe B. So really you need to look at all the info from A and B, rather than B alone. Obviously A and B are quite correlated (if SF new beats SF old in a statistically significant way in experiment A, then it is likely that it will too in experiment B and so on).

Anyway, I can't quite put some mathematics behind it, but intuitively I would think it's best to load A and B into BayesElo and look at LOS from there.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by mcostalba » Wed Oct 03, 2012 5:15 pm

Joerg Oster wrote:
All games available on demand. Hope this test was helpful. :D
HI Joerg,

yes it was! Thanks a lot for testing.

This made me realize that I defenitly need to setup a gauntlet type test against different engines, sometimes self-match falls short.

I will setup a tournament and verify your results.

Thanks
Marco

Post Reply