Werner wrote:Werner wrote:lkaufman wrote:Would it be easy for you (or someone else) to show what the CEGT 40/20 rating list would look like if we only include games among the top programs (let's say Houdini, Komodo, Stockfish, Critter, and Rybka) (best two versions of each)? I'm curious to see whether that would produce a similar list to the actual one or one that is markedly different. There seems to be quite a disparity between the rating lists and direct play results between top engines; this would be a simple way to determine if this is a real phenomenon or not.
Larry
Hi Larry,
download cegttotal.zip
copy all games with These 10 engines to a new Folder
delete double games and games against other engines:
Now we have only around 2000 games left in our 40/20 list !
Now run elostat and correct startrating :
Code: Select all
Program Elo + - Games Score Av.Op. Draws
1 Houdini 3 x64 1CPU : 3049 21 21 550 58.6 % 2989 48.5 %
2 Komodo CCT x64 : 3013 22 22 500 53.9 % 2985 48.2 %
3 Komodo 5.1r2 x64 1CPU : 3009 21 21 500 53.4 % 2985 51.6 %
4 Stockfish 4.0 x64 1CPU : 3002 23 23 451 49.7 % 3004 50.6 %
5 Critter 1.6 x64 1CPU : 2985 20 20 569 47.9 % 3000 51.5 %
6 Houdini 2.0c x64 1CPU : 2972 33 33 217 51.4 % 2963 49.3 %
7 Critter 1.4 x64 1CPU : 2967 31 31 200 48.5 % 2978 58.0 %
8 Stockfish 3.0 x64 1CPU : 2965 22 22 450 45.9 % 2994 53.1 %
9 Deep Rybka 4.1 x64 1CPU : 2925 23 23 501 39.6 % 2998 44.5 %
Well - does that say more ??
Best wishes
Werner
Here is the same list with the blitz-results. Startelo tuned to give Houdini 3 the same Rating as in our blitz-list:
Code: Select all
Program Elo + - Games Score Av.Op. Draws
1 Houdini 3.0 x64 1CPU : 3076 21 20 600 57.4 % 3024 45.8 %
2 Stockfish 4.0 x64 1CPU : 3062 21 21 503 53.7 % 3037 50.9 %
3 Komodo CCT x64 1CPU : 3051 22 22 501 53.7 % 3025 48.3 %
4 Komodo 5.1r2 x64 1CPU : 3045 22 22 500 52.9 % 3025 48.6 %
5 Critter 1.6 x64 1CPU : 3036 17 17 804 51.7 % 3024 52.4 %
6 Houdini 2.0c x64 1CPU : 3033 35 35 200 54.0 % 3006 48.0 %
7 Critter 1.4 x64 1CPU : 2986 50 50 100 51.5 % 2975 47.0 %
8 Stockfish 3.0 x64 1CPU : 2976 21 21 500 41.3 % 3037 51.4 %
9 Rybka 4.0 x64 1CPU : 2975 17 18 802 41.7 % 3033 47.3 %
10 Rybka 4.1 x64 1CPU : 2974 41 43 102 41.2 % 3036 60.8 %
Hi, thanks for the stats.
I do not know to whom I shall refer my question, I probably should not have asked it at all, but still: the CEGT tests show that in blitz Stockfish 4 has gained 86 elo in comparison to Stockfish 3, while in rapid (40/20 should be rapid) just 37 elo. That is more than 2 times lower elo gain, and considerable at that.
Does that suggest that there is the probability that under the newly elaborated testing and development framework Stockfish suddenly started being less scalable? Any ideas if this might be so, and why?
The 3 Champs of Clemens with a version of Stockfish close to Stockfish 4, compared to Ingo's blitz results somehow also imply something like this.
Does someone have any additional results with different TC that would support or reject this hypothesis? (but not the scalability measurements done under the Stock framework, with extremely fast TC only very slightly increased, no where near to even blitz). Is it possible that Stockfish already does not scale so well with normal and very long TC?
Best, Lyudmil