First, I played crafty vs 5 other opponents, including an older 21.7 version. The version I am testing here is not particularly good yet, representing some significant "removals" from the evaluation. So the results are not particularly interesting from that perspective. The 5 opponents were played on 40 starting positions, playing 4 rounds for each position, alternating colors. So a total of 800 games per match, and I am giving 4 consecutive match results, all the same opponents, all played at a time control of 5 + 5 (5 minutes on clock, 5 seconds increment added per move). I lost a game here and there due to data corruption on our big storage system, so some of the matches show 799 rather than 800 games because once in a while the PGN for the last game would be somehow corrupted (a different issue).
I ran these 800 game matches thru Remi's BayesElo. You can look at the four sets of results, but imagine that in each of those tests, crafty-22.2 was a slightly different version with a tweak or two added. Which of the four looks the best? And then realize that all programs are identical for the 4 matches. How would one reliably draw any conclusion from a match containing only 800 games since the error bar is significant, and the variability is even more significant. First the data:
Code: Select all
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   121   42   41   160   68%   -18   17%
   2 Glaurung 1.1 SMP        61   42   41   160   60%   -18   13%
   3 Fruit 2.1               49   41   40   160   59%   -18   15%
   4 opponent-21.7           13   38   38   159   55%   -18   33%
   5 Crafty-22.2            -18   18   18   799   47%     4   19%
   6 Arasan 10.0           -226   42   45   160   23%   -18   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    81   42   41   160   63%   -17   16%
   2 opponent-21.7           61   38   38   159   62%   -17   33%
   3 Glaurung 1.1 SMP        46   42   41   160   58%   -17   13%
   4 Fruit 2.1               35   40   40   160   57%   -17   19%
   5 Crafty-22.2            -17   18   18   799   47%     3   19%
   6 Arasan 10.0           -205   42   45   160   26%   -17   16%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   113   43   41   160   66%   -12   12%
   2 opponent-21.7           73   39   38   159   63%   -12   32%
   3 Fruit 2.1               21   41   40   160   54%   -12   15%
   4 Crafty-22.2            -12   18   18   799   48%     2   18%
   5 Glaurung 1.1 SMP       -35   41   41   160   47%   -12   11%
   6 Arasan 10.0           -161   41   43   160   30%   -12   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   131   45   42   160   70%   -33   10%
   2 Fruit 2.1               64   41   40   160   63%   -33   19%
   3 Glaurung 1.1 SMP        25   41   40   160   58%   -33   15%
   4 opponent-21.7           13   37   37   160   57%   -33   36%
   5 Crafty-22.2            -33   18   18   800   45%     7   19%
   6 Arasan 10.0           -199   42   44   160   29%   -33   15%
Now does anyone _really_ believe that 800 games are enough? Later I will show some _much_ bigger matches as well, showing the same kind of variability. Here are two quickies that represent 25,000 games per match for two matches, just for starters (same time control):
Code: Select all
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123    8    8  5120   66%     2   15%
   2 Fruit 2.1               38    8    7  5119   55%     2   19%
   3 opponent-21.7           28    7    7  5119   54%     2   34%
   4 Crafty-22.2              2    4    4 25597   50%     0   19%
   5 Glaurung 1.1 SMP         2    8    8  5120   50%     2   14%
   6 Arasan 10.0           -193    8    9  5119   26%     2   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   118    8    8  5120   67%   -19   13%
   2 Fruit 2.1               42    8    8  5120   58%   -19   17%
   3 opponent-21.7           32    7    7  5115   58%   -19   36%
   4 Glaurung 1.1 SMP        20    8    8  5120   55%   -19   12%
   5 Crafty-22.2            -19    4    4 25595   47%     4   19%
   6 Arasan 10.0           -193    8    8  5120   28%   -19   16%
 There is a 21 Elo difference between the two.  The first result says 2 +/- 8, while the second says -19 +/- 4.  The ranges don't even overlap.  Which points out that this kind of statistic is good for the sample under observation, but not necessarily representative of the total population of potential games, without playing a _lot_ more games.  Some would say that the second match says crafty is somewhere between -15 and -23.  Which is OK.  But then what does the first bigger match say?
  There is a 21 Elo difference between the two.  The first result says 2 +/- 8, while the second says -19 +/- 4.  The ranges don't even overlap.  Which points out that this kind of statistic is good for the sample under observation, but not necessarily representative of the total population of potential games, without playing a _lot_ more games.  Some would say that the second match says crafty is somewhere between -15 and -23.  Which is OK.  But then what does the first bigger match say?  
"things that make you go hmmm......."
