I defined a testing method based on the same principle but with a completely different implementation.
I took all games played at ICCF during the year 2008 between master level (2400+) correspondence players with at least 40 moves played.
From these games I extracted the EPD of the positions at move 15, 25 and 35 both with white and with black to move.
I then discarded all positions that were present more than once in the resulting epd file.
I was left with 4830 positions.
Testing was done using Polyglot’s “epd-test” function. For each position the tested engine was given three seconds to find the move that the correspondence master had played. Test duration was thus a little less than four hours per engine.
The total number of “found” moves was recorded.
Testing was performed on a Quad PC (Q6600 overclocked at 3.5 Ghz) under windows XP 64 bits.
Using this method 16 engines with CCRL 40/40 ratings regularly distributed between 2590 and 3320 were tested. When existing, 64-bits version of the engines were used.
Here is the list of the engines with the number of threads used by each one:
Code: Select all
Rybka 3 - T4
Naum 4 - T4
Rybka 2.2n2 - T4
Zappa Mex-II - T4
DeepSjeng WC2008 - T4
Hiarcs 12 - T4
Bright 0.4a - T4
Glaurung 2.2 - T4
Naum 3.1 - T1
Fruit 2.3.1 - T1
Spike 1.2 Turin - T1
ChessTiger 2007.1 - T1
Colossus 2008b - T1
Aristarch 4.50 - T1
SOS 5.1 - T1
Yace 0.99.87 - T1
Here are the results :
Code: Select all
Estimated CCRL 40/40 Abs(Error)
3164 3228 64
3101 3152 51
3032 3109 77
3109 3070 39
2905 3029 124
3161 3004 157
2979 2998 19
3007 2994 13
2917 2962 45
2912 2882 30
2869 2849 20
2774 2802 28
2827 2747 80
2646 2699 53
2760 2666 94
2619 2590 29
So there is some kind of trend according which stronger engines tend to agree more than weaker ones with the moves played by correspondence masters when given a very short time for analysis.
But is this link close enough to be of any practical value for effective testing tasks ?
Median error of the estimation is 58 Elo points : this means that when we apply this rating procedure to the 16 engines, our estimation will be closer than 58 points from the real elo for 50 % of engines.
But for the other engines we could get a considerably more faulty value.
The test overestimates DeepSjeng by 157 elo points and underestimates Zappa by 124 points.
And this is when we back-apply the formula to those engines from which the formula has been built. We must always fear that applying it to a “foreign” engine could lead to a larger error …
So the conclusion is evident : there is some kind of value in this test (for example for the fast preliminary rating of a completely unknown engine) but it is far from being enough precise for the one who is busy tuning an engine and needs to discriminate between two versions that will probably differ by no more than a few elo points.
Marc
PS I also tested quite a few variants of the test, with more or less time allowed, larger or smaller number of positions, higher-order interpolation formulas, and so on : I was not able to get something better than the example shown.