Laskos wrote:The excellent FGRL rating list (
http://www.fastgm.de/index.html) contains two Top 10 rating lists for 10' + 6'' and 60' + 15'' TC with identical engines on one core. We can make direct comparisons of engine performances.
1/
10' + 6''
Code: Select all
10' + 6''
Ordo v1.0.9.2: 3000
Engine : Elo Diff Error Points (%) W D L D(%) CFS W/L
------------------------------------------------------------------------------------------------------ ------
1 Stockfish 8 : 3151 0 9 1916.0 70.96 1209 1414 77 52.37 89 15.70
2 Komodo 10.4 : 3143 -8 9 1889.0 69.96 1224 1330 146 49.26 63 8.38
3 Houdini 5.01 : 3141 -10 8 1882.0 69.70 1193 1378 129 51.04 100 9.25
4 Deep Shredder 13 : 3009 -142 8 1390.0 51.48 630 1520 550 56.30 100 1.145
5 Fire 5 : 2983 -168 8 1289.0 47.74 542 1494 664 55.33 100 0.816
6 Fizbo 1.9 : 2957 -194 8 1186.0 43.93 476 1420 804 52.59 100 0.592
7 Gull 3 : 2941 -210 8 1125.0 41.67 399 1452 849 53.78 100 0.470
8 Andscacs 0.89 : 2901 -250 8 975.5 36.13 330 1291 1079 47.81 98 0.306
9 Fritz 15 : 2889 -262 8 930.0 34.44 282 1296 1122 48.00 72 0.251
10 Chiron 4 : 2885 -266 8 917.5 33.98 271 1293 1136 47.89 --- 0.239
White advantage = 40.58 +/- 2.07
Draw rate (equal opponents) = 63.46 % +/- 0.53
2/
60' + 15''
Code: Select all
60' + 15''
Ordo v1.2.6: 3000
Engine : Elo Diff Error Points (%) W D L D(%) CFS W/L
---------------------------------------------------------------------------------------------------- ------
1 Stockfish 8 : 3146 0 12 950.5 70.41 587 727 36 53.85 51 16.31
2 Komodo 10.4 : 3146 0 12 950.0 70.37 615 670 65 49.63 100 9.46
3 Houdini 5.01 : 3119 -27 11 903.5 66.93 516 775 59 57.41 100 8.74
4 Deep Shredder 13 : 3015 -131 11 706.5 52.33 304 805 241 59.63 99 1.261
5 Fire 5 : 2997 -149 10 670.5 49.67 287 767 296 56.81 100 0.970
6 Fizbo 1.9 : 2949 -197 11 577.5 42.78 208 739 403 54.74 83 0.516
7 Gull 3 : 2941 -205 11 562.5 41.67 172 781 397 57.85 97 0.433
8 Andscacs 0.89 : 2926 -220 11 533.0 39.48 176 714 460 52.89 100 0.383
9 Chiron 4 : 2885 -261 11 457.0 33.85 126 662 562 49.04 88 0.224
10 Fritz 15 : 2875 -271 11 439.0 32.52 106 666 578 49.33 --- 0.183
White advantage = 39.23 +/- 2.84
Draw rate (equal opponents) = 66.78 % +/- 0.74
Elo is not an adequate parametrization of the scaling. Rating at longer time controls is subjected to Elo compression, due to increasing draw rate. So, a weaker engine might appear to approach Elo-wise a stronger one (relatively gain strength), but this might be just due to the increasing number of draws, without affecting the relative strength. More related to relative strength is Win/Loss rate for every engine in the list. Here I post the rating list of scaling of engines in Win/Loss ratios from Blitz TC to Long TC. Also
log10 list for ratings to be additive.
Scaling to Long Time Control on one core:
Code: Select all
Engine Scaling = (W2*L1)/(W1*L2) 100*log10(Scaling)
------------------------------------------------------------------------------------
1 Andscacs 0.89 : 1.252 9.76
2 Fire 5 : 1.189 7.52
3 Komodo 10.4 : 1.129 5.27
4 Deep Shredder 13 : 1.101 4.18
5 Stockfish 8 : 1.039 1.66
6 Houdini 5.01 : 0.945 -2.46
7 Chiron 4 : 0.937 -2.83
8 Gull 3 : 0.921 -3.57
9 Fizbo 1.9 : 0.872 -5.95
10 Fritz 15 : 0.729 -13.73
thanks Kai, but, no matter how enthusiastic many posters on this thread are, I very much suppose only 50 to 70% of above data could be entirely valid, the rest being due to unaccounted for factors (what people generally like to call random noise).
any guess how would a second comparison, using the same list, but with TCs 60'' & 10' per game fare?
my guess is, 50-70% of above-drawn cnclusions would be valid, probably still closer to 70%.
as far as I would have any clue, scaling with different TCs is not linear for almost any engine out there. for example, SF scales worse from the TC tested at the framework, 60'', to blitz TC, 3-5 minutes, then it reaches its peak performance at about 10 to 30 minutes per game, then its performance drops back again at 1 hour, and then, guess what, it is unexpectedly boosetd again at very long TC, for example TCEC one.
I guess pretty much the same is valid for so many other engines around.
there are simply too many unaccounted for factors.
as no time to open another reply here, I would like to inquire why would Dann consider search playing bigger role in sclaing than eval? For me, it is difficult to draw any definitive conclusions, as search and eval in modern-day engines are simply inseparable, but, if anything, I have absolutely no doubt that evaluation parameters are more likely to achieve worse/better scaling than search ones.
anyone having precise data that null move pruning, for example, would enhance scaling at longer TCs when compared to other search routines, or eval?