I looked at some of the data a little closer:
Using SF8 as a the reference point, we see that 2 engines (Komodo and Fire) which "scale" better with time actually achieve a worse result.
Deep Shredder does indeed fair a bit better, but the result is within the realm of Elo compression, so hard to make any real conclusion there...
Andscacs performs the best in this regard, and the result is good enough that it would appear reasonably convincing that Andscacs does scale better with time than SF8 (and very likely most chess engines).
However, 2 data points is a bit limiting, and single core results are not of the most practical value these days... Maybe I will run a few matches between SF and Andscacs to gather another datapoint or two. Actually quite interesting that it appears to scale significantly better than the rest of the field.
Scaling of engines from FGRL rating list
Moderators: hgm, Rebel, chrisw
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
-
- Posts: 6052
- Joined: Tue Jun 12, 2012 12:41 pm
Re: Scaling of engines from FGRL rating list
thanks Kai, but, no matter how enthusiastic many posters on this thread are, I very much suppose only 50 to 70% of above data could be entirely valid, the rest being due to unaccounted for factors (what people generally like to call random noise).Laskos wrote:The excellent FGRL rating list (http://www.fastgm.de/index.html) contains two Top 10 rating lists for 10' + 6'' and 60' + 15'' TC with identical engines on one core. We can make direct comparisons of engine performances.
1/
10' + 6''2/Code: Select all
10' + 6'' Ordo v1.0.9.2: 3000 Engine : Elo Diff Error Points (%) W D L D(%) CFS W/L ------------------------------------------------------------------------------------------------------ ------ 1 Stockfish 8 : 3151 0 9 1916.0 70.96 1209 1414 77 52.37 89 15.70 2 Komodo 10.4 : 3143 -8 9 1889.0 69.96 1224 1330 146 49.26 63 8.38 3 Houdini 5.01 : 3141 -10 8 1882.0 69.70 1193 1378 129 51.04 100 9.25 4 Deep Shredder 13 : 3009 -142 8 1390.0 51.48 630 1520 550 56.30 100 1.145 5 Fire 5 : 2983 -168 8 1289.0 47.74 542 1494 664 55.33 100 0.816 6 Fizbo 1.9 : 2957 -194 8 1186.0 43.93 476 1420 804 52.59 100 0.592 7 Gull 3 : 2941 -210 8 1125.0 41.67 399 1452 849 53.78 100 0.470 8 Andscacs 0.89 : 2901 -250 8 975.5 36.13 330 1291 1079 47.81 98 0.306 9 Fritz 15 : 2889 -262 8 930.0 34.44 282 1296 1122 48.00 72 0.251 10 Chiron 4 : 2885 -266 8 917.5 33.98 271 1293 1136 47.89 --- 0.239 White advantage = 40.58 +/- 2.07 Draw rate (equal opponents) = 63.46 % +/- 0.53
60' + 15''Elo is not an adequate parametrization of the scaling. Rating at longer time controls is subjected to Elo compression, due to increasing draw rate. So, a weaker engine might appear to approach Elo-wise a stronger one (relatively gain strength), but this might be just due to the increasing number of draws, without affecting the relative strength. More related to relative strength is Win/Loss rate for every engine in the list. Here I post the rating list of scaling of engines in Win/Loss ratios from Blitz TC to Long TC. Also log10 list for ratings to be additive.Code: Select all
60' + 15'' Ordo v1.2.6: 3000 Engine : Elo Diff Error Points (%) W D L D(%) CFS W/L ---------------------------------------------------------------------------------------------------- ------ 1 Stockfish 8 : 3146 0 12 950.5 70.41 587 727 36 53.85 51 16.31 2 Komodo 10.4 : 3146 0 12 950.0 70.37 615 670 65 49.63 100 9.46 3 Houdini 5.01 : 3119 -27 11 903.5 66.93 516 775 59 57.41 100 8.74 4 Deep Shredder 13 : 3015 -131 11 706.5 52.33 304 805 241 59.63 99 1.261 5 Fire 5 : 2997 -149 10 670.5 49.67 287 767 296 56.81 100 0.970 6 Fizbo 1.9 : 2949 -197 11 577.5 42.78 208 739 403 54.74 83 0.516 7 Gull 3 : 2941 -205 11 562.5 41.67 172 781 397 57.85 97 0.433 8 Andscacs 0.89 : 2926 -220 11 533.0 39.48 176 714 460 52.89 100 0.383 9 Chiron 4 : 2885 -261 11 457.0 33.85 126 662 562 49.04 88 0.224 10 Fritz 15 : 2875 -271 11 439.0 32.52 106 666 578 49.33 --- 0.183 White advantage = 39.23 +/- 2.84 Draw rate (equal opponents) = 66.78 % +/- 0.74
Scaling to Long Time Control on one core:Code: Select all
Engine Scaling = (W2*L1)/(W1*L2) 100*log10(Scaling) ------------------------------------------------------------------------------------ 1 Andscacs 0.89 : 1.252 9.76 2 Fire 5 : 1.189 7.52 3 Komodo 10.4 : 1.129 5.27 4 Deep Shredder 13 : 1.101 4.18 5 Stockfish 8 : 1.039 1.66 6 Houdini 5.01 : 0.945 -2.46 7 Chiron 4 : 0.937 -2.83 8 Gull 3 : 0.921 -3.57 9 Fizbo 1.9 : 0.872 -5.95 10 Fritz 15 : 0.729 -13.73
any guess how would a second comparison, using the same list, but with TCs 60'' & 10' per game fare?
my guess is, 50-70% of above-drawn cnclusions would be valid, probably still closer to 70%.
as far as I would have any clue, scaling with different TCs is not linear for almost any engine out there. for example, SF scales worse from the TC tested at the framework, 60'', to blitz TC, 3-5 minutes, then it reaches its peak performance at about 10 to 30 minutes per game, then its performance drops back again at 1 hour, and then, guess what, it is unexpectedly boosetd again at very long TC, for example TCEC one.
I guess pretty much the same is valid for so many other engines around.
there are simply too many unaccounted for factors.
as no time to open another reply here, I would like to inquire why would Dann consider search playing bigger role in sclaing than eval? For me, it is difficult to draw any definitive conclusions, as search and eval in modern-day engines are simply inseparable, but, if anything, I have absolutely no doubt that evaluation parameters are more likely to achieve worse/better scaling than search ones.
anyone having precise data that null move pruning, for example, would enhance scaling at longer TCs when compared to other search routines, or eval?
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling of engines from FGRL rating list.
You are comparing head-to-head results, with 300 and respectively 150 games each. It's better to have the whole set of games 2700 and 1350 per each engine respectively. It's hard to invoke here Elo compression, as even the initial lists (opening post tables) show only a little of it. Standard deviation for 10' + 6'' ratings is 108 Elo points, of 60' + 15'' list is 104 Elo points. I computed Ordo Wilo ratings from cleaned of draws PGNs:jhellis3 wrote:I looked at some of the data a little closer:
Using SF8 as a the reference point, we see that 2 engines (Komodo and Fire) which "scale" better with time actually achieve a worse result.
Deep Shredder does indeed fair a bit better, but the result is within the realm of Elo compression, so hard to make any real conclusion there...
Andscacs performs the best in this regard, and the result is good enough that it would appear reasonably convincing that Andscacs does scale better with time than SF8 (and very likely most chess engines).
However, 2 data points is a bit limiting, and single core results are not of the most practical value these days... Maybe I will run a few matches between SF and Andscacs to gather another datapoint or two. Actually quite interesting that it appears to scale significantly better than the rest of the field.
Code: Select all
10min + 6s:
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 Stockfish 8 : 3511.4 41.4 1209.0 1286 94.0 100
2 Houdini 5.01 : 3397.5 35.6 1193.0 1322 90.2 62
3 Komodo 10.4 : 3389.9 37.9 1224.0 1370 89.3 100
4 Deep Shredder 13 : 2993.4 25.7 630.0 1180 53.4 100
5 Fire 5 : 2923.5 23.6 542.0 1206 44.9 100
6 Fizbo 1.9 : 2848.2 23.7 476.0 1280 37.2 98
7 Gull 3 : 2813.5 24.3 399.0 1248 32.0 100
8 Andscacs 0.89 : 2731.7 25.4 330.0 1409 23.4 96
9 Fritz 15 : 2701.5 24.8 282.0 1404 20.1 75
10 Chiron 4 : 2689.5 26.6 271.0 1407 19.3 ---
SD=315
60min + 15s:
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 Stockfish 8 : 3494.4 64.1 587.0 623 94.2 99
2 Komodo 10.4 : 3401.5 52.0 615.0 680 90.4 61
3 Houdini 5.01 : 3390.4 54.4 516.0 575 89.7 100
4 Deep Shredder 13 : 3011.5 37.2 304.0 545 55.8 96
5 Fire 5 : 2962.5 36.3 287.0 583 49.2 100
6 Fizbo 1.9 : 2816.5 34.4 208.0 611 34.0 78
7 Gull 3 : 2796.8 36.8 172.0 569 30.2 75
8 Andscacs 0.89 : 2779.4 36.2 176.0 636 27.7 100
9 Chiron 4 : 2690.7 38.3 126.0 688 18.3 90
10 Fritz 15 : 2656.3 39.4 106.0 684 15.5 ---
SD=316
Code: Select all
Engine Scaling 400*log10[(W2/L2)/(W1/L1))] Scaling Ordo Wilo Scaling Elo
------------------------------------------------------------------------------------------------
1 Andscacs 0.89 : 38.9 47.7 25
2 Fire 5 : 29.9 39.0 14
3 Komodo 10.4 : 21.0 11.6 3
4 Deep Shredder 13 : 16.7 18.1 6
5 Stockfish 8 : 6.6 -17.0 -5
6 Houdini 5.01 : -9.7 -7.1 -22
7 Chiron 4 : -10.8 1.2 0
8 Gull 3 : -14.1 -16.7 0
9 Fizbo 1.9 : -23.8 -31.7 -8
10 Fritz 15 : -54.8 -45.2 -14
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
No it isn't. Checkmate. And before you get mad, please consider my statement was purposely made to mirror yours (just with simpler language).You are comparing head-to-head results, with 300 and respectively 150 games each. It's better to have the whole set of games 2700 and 1350 per each engine respectively.
Again, no it isn't. I did it quite simply in fact. A minimum of 17 Elo is clearly at stake for all engines rated lower than SF7, scaling up to 28 by the time of Fire5. And obviously increasing from there.It's hard to invoke here Elo compression, as even the initial lists (opening post tables) show only a little of it.
I don't want to be mean, but... why compute a meaningless number (hint: this is not an actual question).Standard deviation for 10' + 6'' ratings is 108 Elo points, of 60' + 15'' list is 104 Elo points.
I'm not interested in what things are supposed to do. I am not interested in conjecture with confirmation bias. I do have a degree in "the maths". I do understand what one can do with numbers (both good and bad). What I am interested in is the truth. And when empirical data (AKA reality) runs counter to one's claims, it is time to re-evaluate matters.Wilo ratings are not supposed to compress at longer time controls, so these ratings should be directly comparable (standard deviation of ratings is almost identical here). And the scaling with all the data shows the following for 3 indicators:
The reality is that Komodo and Fire have a worse score vs SF8 at LTC than STC despite "scaling better". If that doesn't cause you to at least pause and reconsider what is going on, well.... that is your business. I'm not here to convert anyone or push any particular agenda or conclusion.
Hell, even with the data I presented, which does, at first glance, look favorable for Andscacs, could be interpreted a very different way. Instead of saying Andscacs scales well with increasing time, we might say it actually just scales horribly with decreasing time. The problem is we only looked at 2 data points and have no way of knowing for sure, without broadening our scope. And it doesn't even have to be one or the other, but could potentially be a combination of both, where Andscacs does scale better with more time but not nearly as much as it first appears because it also scales relatively poorly with less time.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling of engines from FGRL rating list.
What is this gibberish?jhellis3 wrote:No it isn't. Checkmate. And before you get mad, please consider my statement was purposely made to mirror yours (just with simpler language).You are comparing head-to-head results, with 300 and respectively 150 games each. It's better to have the whole set of games 2700 and 1350 per each engine respectively.
Again, no it isn't. I did it quite simply in fact. A minimum of 17 Elo is clearly at stake for all engines rated lower than SF7, scaling up to 28 by the time of Fire5. And obviously increasing from there.It's hard to invoke here Elo compression, as even the initial lists (opening post tables) show only a little of it.
I don't want to be mean, but... why compute a meaningless number (hint: this is not an actual question).Standard deviation for 10' + 6'' ratings is 108 Elo points, of 60' + 15'' list is 104 Elo points.
I'm not interested in what things are supposed to do. I am not interested in conjecture with confirmation bias. I do have a degree in "the maths". I do understand what one can do with numbers (both good and bad). What I am interested in is the truth. And when empirical data (AKA reality) runs counter to one's claims, it is time to re-evaluate matters.Wilo ratings are not supposed to compress at longer time controls, so these ratings should be directly comparable (standard deviation of ratings is almost identical here). And the scaling with all the data shows the following for 3 indicators:
The reality is that Komodo and Fire have a worse score vs SF8 at LTC than STC despite "scaling better". If that doesn't cause you to at least pause and reconsider what is going on, well.... that is your business. I'm not here to convert anyone or push any particular agenda or conclusion.
Hell, even with the data I presented, which does, at first glance, look favorable for Andscacs, could be interpreted a very different way. Instead of saying Andscacs scales well with increasing time, we might say it actually just scales horribly with decreasing time. The problem is we only looked at 2 data points and have no way of knowing for sure, without broadening our scope. And it doesn't even have to be one or the other, but could potentially be a combination of both, where Andscacs does scale better with more time but not nearly as much as it first appears because it also scales relatively poorly with less time.
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
-
- Posts: 265
- Joined: Sat Feb 22, 2014 8:37 pm
Re: Scaling of engines from FGRL rating list
Is the following guess/interpretation gibberish or plausible? :
Since Stockfish 8 seems to be the highest rated engine in the list, it is closer to a perfect player and finds "the best" move quicker than all the other engines in average. Hence it cannot scale much better with more time control because there is not as much strength to improve compared to any other engine.
Thank you math/logic guys for answering.
Since Stockfish 8 seems to be the highest rated engine in the list, it is closer to a perfect player and finds "the best" move quicker than all the other engines in average. Hence it cannot scale much better with more time control because there is not as much strength to improve compared to any other engine.
Thank you math/logic guys for answering.
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Scaling of engines from FGRL rating list
I think that is like this in part, as I prefer to think that is not near perfect, but the strongest in the current subset of chess that the engines dominate. Thus at some point will be beaten with other more capable engines.Isaac wrote:Is the following guess/interpretation gibberish or plausible? :
Since Stockfish 8 seems to be the highest rated engine in the list, it is closer to a perfect player and finds "the best" move quicker than all the other engines in average. Hence it cannot scale much better with more time control because there is not as much strength to improve compared to any other engine.
Thank you math/logic guys for answering.
Daniel José - http://www.andscacs.com
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: Scaling of engines from FGRL rating list.
Elo is determined by the performance against all opponents. Kai has shown the scaling effect of a large group of strong programs.
But if you are only interested in head to head scaling between two programs, then run the games. You have not proven your point using runs of 150-300 games, since the error margin swamps the results. To get you started, I posted direct SF vs K (a development version before K 10.4) here:
http://www.talkchess.com/forum/viewtopi ... ht=#711223
The scaling is quite clear from the results. But I would encourage you to use Komodo 10.4 and whatever version of Stockfish you want and run similar games at varying time controls (starting at at least game in 3 mins + 2 sec, to reduce the higher draw rates of super fast games, then repeat the runs at longer time controls getting a couple orders of magnitude). Run enough games to lower the error margins to prove the rating difference at each time control. Then publish the results.
BTW, stating something is true does not make it true. Even if you are the current President of the United States. If you want to convince people, you have to get the data.
But if you are only interested in head to head scaling between two programs, then run the games. You have not proven your point using runs of 150-300 games, since the error margin swamps the results. To get you started, I posted direct SF vs K (a development version before K 10.4) here:
http://www.talkchess.com/forum/viewtopi ... ht=#711223
The scaling is quite clear from the results. But I would encourage you to use Komodo 10.4 and whatever version of Stockfish you want and run similar games at varying time controls (starting at at least game in 3 mins + 2 sec, to reduce the higher draw rates of super fast games, then repeat the runs at longer time controls getting a couple orders of magnitude). Run enough games to lower the error margins to prove the rating difference at each time control. Then publish the results.
BTW, stating something is true does not make it true. Even if you are the current President of the United States. If you want to convince people, you have to get the data.
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
Um... what point exactly is that? I didn't try to prove anything other than to say that the "evidence" which was present in support of the conclusion made was insufficient, which it is, and that thus the conclusion itself is dubious, which it is.You have not proven your point using runs of 150-300 games
Yes, Komodo clearly scales poorly with less time.The scaling is quite clear from the results.
Exactly my point. It seems we agree.If you want to convince people, you have to get the data.