Scaling of engines from FGRL rating list

jhellis3 · Post by **jhellis3** » Sun Apr 09, 2017 8:28 am

I looked at some of the data a little closer:

Using SF8 as a the reference point, we see that 2 engines (Komodo and Fire) which "scale" better with time actually achieve a worse result.

Deep Shredder does indeed fair a bit better, but the result is within the realm of Elo compression, so hard to make any real conclusion there...

Andscacs performs the best in this regard, and the result is good enough that it would appear reasonably convincing that Andscacs does scale better with time than SF8 (and very likely most chess engines).

However, 2 data points is a bit limiting, and single core results are not of the most practical value these days... Maybe I will run a few matches between SF and Andscacs to gather another datapoint or two. Actually quite interesting that it appears to scale significantly better than the rest of the field.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Sun Apr 09, 2017 10:23 am

Laskos wrote:The excellent FGRL rating list (http://www.fastgm.de/index.html) contains two Top 10 rating lists for 10' + 6'' and 60' + 15'' TC with identical engines on one core. We can make direct comparisons of engine performances.

1/
10' + 6''

Code: Select all

10' + 6''

Ordo v1.0.9.2: 3000

     Engine              :    Elo   Diff   Error    Points    (%)       W      D      L     D(%)   CFS      W/L
  ------------------------------------------------------------------------------------------------------  ------
   1 Stockfish 8         :   3151      0       9    1916.0   70.96    1209   1414     77   52.37    89     15.70
   2 Komodo 10.4         :   3143     -8       9    1889.0   69.96    1224   1330    146   49.26    63      8.38
   3 Houdini 5.01        :   3141    -10       8    1882.0   69.70    1193   1378    129   51.04   100      9.25
   4 Deep Shredder 13    :   3009   -142       8    1390.0   51.48     630   1520    550   56.30   100      1.145
   5 Fire 5              :   2983   -168       8    1289.0   47.74     542   1494    664   55.33   100      0.816
   6 Fizbo 1.9           :   2957   -194       8    1186.0   43.93     476   1420    804   52.59   100      0.592
   7 Gull 3              :   2941   -210       8    1125.0   41.67     399   1452    849   53.78   100      0.470
   8 Andscacs 0.89       :   2901   -250       8     975.5   36.13     330   1291   1079   47.81    98      0.306
   9 Fritz 15            :   2889   -262       8     930.0   34.44     282   1296   1122   48.00    72      0.251
  10 Chiron 4            :   2885   -266       8     917.5   33.98     271   1293   1136   47.89   ---      0.239

White advantage = 40.58 +/- 2.07
Draw rate (equal opponents) = 63.46 % +/- 0.53

2/
60' + 15''

Code: Select all

60' + 15''

 Ordo v1.2.6: 3000

     Engine              :    Elo   Diff   Error   Points    (%)      W      D      L     D(%)   CFS     W/L
 ----------------------------------------------------------------------------------------------------  ------
   1 Stockfish 8         :   3146      0      12    950.5   70.41    587    727     36   53.85    51    16.31
   2 Komodo 10.4         :   3146      0      12    950.0   70.37    615    670     65   49.63   100     9.46
   3 Houdini 5.01        :   3119    -27      11    903.5   66.93    516    775     59   57.41   100     8.74
   4 Deep Shredder 13    :   3015   -131      11    706.5   52.33    304    805    241   59.63    99     1.261
   5 Fire 5              :   2997   -149      10    670.5   49.67    287    767    296   56.81   100     0.970
   6 Fizbo 1.9           :   2949   -197      11    577.5   42.78    208    739    403   54.74    83     0.516
   7 Gull 3              :   2941   -205      11    562.5   41.67    172    781    397   57.85    97     0.433
   8 Andscacs 0.89       :   2926   -220      11    533.0   39.48    176    714    460   52.89   100     0.383
   9 Chiron 4            :   2885   -261      11    457.0   33.85    126    662    562   49.04    88     0.224
  10 Fritz 15            :   2875   -271      11    439.0   32.52    106    666    578   49.33   ---     0.183

White advantage = 39.23 +/- 2.84
Draw rate (equal opponents) = 66.78 % +/- 0.74

Elo is not an adequate parametrization of the scaling. Rating at longer time controls is subjected to Elo compression, due to increasing draw rate. So, a weaker engine might appear to approach Elo-wise a stronger one (relatively gain strength), but this might be just due to the increasing number of draws, without affecting the relative strength. More related to relative strength is Win/Loss rate for every engine in the list. Here I post the rating list of scaling of engines in Win/Loss ratios from Blitz TC to Long TC. Also log10 list for ratings to be additive.

Scaling to Long Time Control on one core:

Code: Select all

     Engine                 Scaling = (W2*L1)/(W1*L2)    100*log10(Scaling)
  ------------------------------------------------------------------------------------
   1 Andscacs 0.89       :          1.252                     9.76 
   2 Fire 5              :          1.189                     7.52
   3 Komodo 10.4         :          1.129                     5.27
   4 Deep Shredder 13    :          1.101                     4.18
   5 Stockfish 8         :          1.039                     1.66
   6 Houdini 5.01        :          0.945                    -2.46
   7 Chiron 4            :          0.937                    -2.83
   8 Gull 3              :          0.921                    -3.57
   9 Fizbo 1.9           :          0.872                    -5.95
  10 Fritz 15            :          0.729                   -13.73

thanks Kai, but, no matter how enthusiastic many posters on this thread are, I very much suppose only 50 to 70% of above data could be entirely valid, the rest being due to unaccounted for factors (what people generally like to call random noise).

any guess how would a second comparison, using the same list, but with TCs 60'' & 10' per game fare?

my guess is, 50-70% of above-drawn cnclusions would be valid, probably still closer to 70%.

as far as I would have any clue, scaling with different TCs is not linear for almost any engine out there. for example, SF scales worse from the TC tested at the framework, 60'', to blitz TC, 3-5 minutes, then it reaches its peak performance at about 10 to 30 minutes per game, then its performance drops back again at 1 hour, and then, guess what, it is unexpectedly boosetd again at very long TC, for example TCEC one.

I guess pretty much the same is valid for so many other engines around.

there are simply too many unaccounted for factors.

as no time to open another reply here, I would like to inquire why would Dann consider search playing bigger role in sclaing than eval? For me, it is difficult to draw any definitive conclusions, as search and eval in modern-day engines are simply inseparable, but, if anything, I have absolutely no doubt that evaluation parameters are more likely to achieve worse/better scaling than search ones.

anyone having precise data that null move pruning, for example, would enhance scaling at longer TCs when compared to other search routines, or eval?

Laskos · Post by **Laskos** » Mon Apr 10, 2017 11:22 am

jhellis3 wrote:I looked at some of the data a little closer:

Using SF8 as a the reference point, we see that 2 engines (Komodo and Fire) which "scale" better with time actually achieve a worse result.

Deep Shredder does indeed fair a bit better, but the result is within the realm of Elo compression, so hard to make any real conclusion there...

Andscacs performs the best in this regard, and the result is good enough that it would appear reasonably convincing that Andscacs does scale better with time than SF8 (and very likely most chess engines).

However, 2 data points is a bit limiting, and single core results are not of the most practical value these days... Maybe I will run a few matches between SF and Andscacs to gather another datapoint or two. Actually quite interesting that it appears to scale significantly better than the rest of the field.

You are comparing head-to-head results, with 300 and respectively 150 games each. It's better to have the whole set of games 2700 and 1350 per each engine respectively. It's hard to invoke here Elo compression, as even the initial lists (opening post tables) show only a little of it. Standard deviation for 10' + 6'' ratings is 108 Elo points, of 60' + 15'' list is 104 Elo points. I computed Ordo Wilo ratings from cleaned of draws PGNs:

Code: Select all

10min + 6s:

   # PLAYER              : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 Stockfish 8         : 3511.4   41.4    1209.0    1286    94.0     100    
   2 Houdini 5.01        : 3397.5   35.6    1193.0    1322    90.2      62    
   3 Komodo 10.4         : 3389.9   37.9    1224.0    1370    89.3     100    
   4 Deep Shredder 13    : 2993.4   25.7     630.0    1180    53.4     100    
   5 Fire 5              : 2923.5   23.6     542.0    1206    44.9     100    
   6 Fizbo 1.9           : 2848.2   23.7     476.0    1280    37.2      98    
   7 Gull 3              : 2813.5   24.3     399.0    1248    32.0     100    
   8 Andscacs 0.89       : 2731.7   25.4     330.0    1409    23.4      96    
   9 Fritz 15            : 2701.5   24.8     282.0    1404    20.1      75    
  10 Chiron 4            : 2689.5   26.6     271.0    1407    19.3     ---    

SD=315



60min + 15s:

   # PLAYER              : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 Stockfish 8         : 3494.4   64.1     587.0     623    94.2      99    
   2 Komodo 10.4         : 3401.5   52.0     615.0     680    90.4      61    
   3 Houdini 5.01        : 3390.4   54.4     516.0     575    89.7     100    
   4 Deep Shredder 13    : 3011.5   37.2     304.0     545    55.8      96    
   5 Fire 5              : 2962.5   36.3     287.0     583    49.2     100    
   6 Fizbo 1.9           : 2816.5   34.4     208.0     611    34.0      78    
   7 Gull 3              : 2796.8   36.8     172.0     569    30.2      75    
   8 Andscacs 0.89       : 2779.4   36.2     176.0     636    27.7     100    
   9 Chiron 4            : 2690.7   38.3     126.0     688    18.3      90    
  10 Fritz 15            : 2656.3   39.4     106.0     684    15.5     --- 

SD=316

Wilo ratings are not supposed to compress at longer time controls, so these ratings should be directly comparable (standard deviation of ratings is almost identical here). And the scaling with all the data shows the following for 3 indicators:

Code: Select all

     Engine                  Scaling 400*log10[(W2/L2)/(W1/L1))]    Scaling Ordo Wilo  Scaling Elo
  ------------------------------------------------------------------------------------------------ 
   1 Andscacs 0.89       :                    38.9                        47.7            25
   2 Fire 5              :                    29.9                        39.0            14
   3 Komodo 10.4         :                    21.0                        11.6             3
   4 Deep Shredder 13    :                    16.7                        18.1             6
   5 Stockfish 8         :                     6.6                       -17.0            -5
   6 Houdini 5.01        :                    -9.7                        -7.1           -22
   7 Chiron 4            :                   -10.8                         1.2             0
   8 Gull 3              :                   -14.1                       -16.7             0
   9 Fizbo 1.9           :                   -23.8                       -31.7            -8
  10 Fritz 15            :                   -54.8                       -45.2           -14

With Elo scaling being probably the least apt indicator of scaling.

jhellis3 · Post by **jhellis3** » Mon Apr 10, 2017 7:46 pm

You are comparing head-to-head results, with 300 and respectively 150 games each. It's better to have the whole set of games 2700 and 1350 per each engine respectively.

No it isn't. Checkmate. And before you get mad, please consider my statement was purposely made to mirror yours (just with simpler language).

It's hard to invoke here Elo compression, as even the initial lists (opening post tables) show only a little of it.

Again, no it isn't. I did it quite simply in fact. A minimum of 17 Elo is clearly at stake for all engines rated lower than SF7, scaling up to 28 by the time of Fire5. And obviously increasing from there.

Standard deviation for 10' + 6'' ratings is 108 Elo points, of 60' + 15'' list is 104 Elo points.

I don't want to be mean, but... why compute a meaningless number (hint: this is not an actual question).

Wilo ratings are not supposed to compress at longer time controls, so these ratings should be directly comparable (standard deviation of ratings is almost identical here). And the scaling with all the data shows the following for 3 indicators:

I'm not interested in what things are supposed to do. I am not interested in conjecture with confirmation bias. I do have a degree in "the maths". I do understand what one can do with numbers (both good and bad). What I am interested in is the truth. And when empirical data (AKA reality) runs counter to one's claims, it is time to re-evaluate matters.

The reality is that Komodo and Fire have a worse score vs SF8 at LTC than STC despite "scaling better". If that doesn't cause you to at least pause and reconsider what is going on, well.... that is your business. I'm not here to convert anyone or push any particular agenda or conclusion.

Hell, even with the data I presented, which does, at first glance, look favorable for Andscacs, could be interpreted a very different way. Instead of saying Andscacs scales well with increasing time, we might say it actually just scales horribly with decreasing time. The problem is we only looked at 2 data points and have no way of knowing for sure, without broadening our scope. And it doesn't even have to be one or the other, but could potentially be a combination of both, where Andscacs does scale better with more time but not nearly as much as it first appears because it also scales relatively poorly with less time.

Laskos · Post by **Laskos** » Mon Apr 10, 2017 8:11 pm

jhellis3 wrote:
You are comparing head-to-head results, with 300 and respectively 150 games each. It's better to have the whole set of games 2700 and 1350 per each engine respectively.
No it isn't. Checkmate. And before you get mad, please consider my statement was purposely made to mirror yours (just with simpler language).

It's hard to invoke here Elo compression, as even the initial lists (opening post tables) show only a little of it.
Again, no it isn't. I did it quite simply in fact. A minimum of 17 Elo is clearly at stake for all engines rated lower than SF7, scaling up to 28 by the time of Fire5. And obviously increasing from there.

Standard deviation for 10' + 6'' ratings is 108 Elo points, of 60' + 15'' list is 104 Elo points.
I don't want to be mean, but... why compute a meaningless number (hint: this is not an actual question).

Wilo ratings are not supposed to compress at longer time controls, so these ratings should be directly comparable (standard deviation of ratings is almost identical here). And the scaling with all the data shows the following for 3 indicators:
I'm not interested in what things are supposed to do. I am not interested in conjecture with confirmation bias. I do have a degree in "the maths". I do understand what one can do with numbers (both good and bad). What I am interested in is the truth. And when empirical data (AKA reality) runs counter to one's claims, it is time to re-evaluate matters.

The reality is that Komodo and Fire have a worse score vs SF8 at LTC than STC despite "scaling better". If that doesn't cause you to at least pause and reconsider what is going on, well.... that is your business. I'm not here to convert anyone or push any particular agenda or conclusion.

Hell, even with the data I presented, which does, at first glance, look favorable for Andscacs, could be interpreted a very different way. Instead of saying Andscacs scales well with increasing time, we might say it actually just scales horribly with decreasing time. The problem is we only looked at 2 data points and have no way of knowing for sure, without broadening our scope. And it doesn't even have to be one or the other, but could potentially be a combination of both, where Andscacs does scale better with more time but not nearly as much as it first appears because it also scales relatively poorly with less time.

What is this gibberish?

jhellis3 · Post by **jhellis3** » Mon Apr 10, 2017 8:21 pm

Same.

Isaac · Post by **Isaac** » Mon Apr 10, 2017 9:22 pm

Is the following guess/interpretation gibberish or plausible? :
Since Stockfish 8 seems to be the highest rated engine in the list, it is closer to a perfect player and finds "the best" move quicker than all the other engines in average. Hence it cannot scale much better with more time control because there is not as much strength to improve compared to any other engine.
Thank you math/logic guys for answering.

cdani · Post by **cdani** » Mon Apr 10, 2017 9:43 pm

Isaac wrote:Is the following guess/interpretation gibberish or plausible? :
Since Stockfish 8 seems to be the highest rated engine in the list, it is closer to a perfect player and finds "the best" move quicker than all the other engines in average. Hence it cannot scale much better with more time control because there is not as much strength to improve compared to any other engine.
Thank you math/logic guys for answering.

I think that is like this in part, as I prefer to think that is not near perfect, but the strongest in the current subset of chess that the engines dominate. Thus at some point will be beaten with other more capable engines.

mjlef · Post by **mjlef** » Mon Apr 10, 2017 9:53 pm

Elo is determined by the performance against all opponents. Kai has shown the scaling effect of a large group of strong programs.

But if you are only interested in head to head scaling between two programs, then run the games. You have not proven your point using runs of 150-300 games, since the error margin swamps the results. To get you started, I posted direct SF vs K (a development version before K 10.4) here:

http://www.talkchess.com/forum/viewtopi ... ht=#711223

The scaling is quite clear from the results. But I would encourage you to use Komodo 10.4 and whatever version of Stockfish you want and run similar games at varying time controls (starting at at least game in 3 mins + 2 sec, to reduce the higher draw rates of super fast games, then repeat the runs at longer time controls getting a couple orders of magnitude). Run enough games to lower the error margins to prove the rating difference at each time control. Then publish the results.

BTW, stating something is true does not make it true. Even if you are the current President of the United States. If you want to convince people, you have to get the data.

jhellis3 · Post by **jhellis3** » Mon Apr 10, 2017 10:02 pm

You have not proven your point using runs of 150-300 games

Um... what point exactly is that? I didn't try to prove anything other than to say that the "evidence" which was present in support of the conclusion made was insufficient, which it is, and that thus the conclusion itself is dubious, which it is.

The scaling is quite clear from the results.

Yes, Komodo clearly scales poorly with less time.

If you want to convince people, you have to get the data.

Exactly my point. It seems we agree.

Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.