Made In Heaven class Time Control Comparison

Vinvin · Post by **Vinvin** » Sun Dec 22, 2013 7:17 pm

Modern Times wrote:Brilliant work Aser, the graph is very enlightening.

Question is, at what point does Komodo level off...

+1

Milos · Post by **Milos** » Sun Dec 22, 2013 8:05 pm

Aser Huerga wrote:As a Shaun Brewer suggestion, I decided to run my games at different Time Controls to see how the top engines strength change as time increases. Here are the results:

Five i7-3930K CPUs 4.25 GHz
1 core for all engines
Ponder off
1024 Hash
3-4-5 EGTBs (when available) in SSDs

Code: Select all

3'+1" Time Control

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Houdini 4       :   32.1   13.3    340.0     600   56.7%
   2 Komodo TCEC     :  -14.4   13.4    282.0     600   47.0%
   3 Stockfish DD    :  -17.6   13.8    278.0     600   46.3%


9'+3" Time Control

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   16.6   14.1    320.5     600   53.4%
   2 Houdini 4       :    8.1   13.5    310.0     600   51.7%
   3 Komodo TCEC     :  -24.7   13.4    269.5     600   44.9%

27'+9" Time Control

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   22.6   14.1    328.0     600   54.7%
   2 Houdini 4       :    5.7   13.4    307.0     600   51.2%
   3 Komodo TCEC     :  -28.3   13.4    265.0     600   44.2%

54'+18" Time Control

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   11.4   14.2    314.0     600   52.3%
   2 Houdini 4       :    0.4   13.7    300.5     600   50.1%
   3 Komodo TCEC     :  -11.8   13.6    285.5     600   47.6%

90'+30" Time Control

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   10.5   12.9    313.0     600   52.2%
   2 Komodo TCEC     :    0.8   13.2    301.0     600   50.2%
   3 Houdini 4       :  -11.3   13.1    286.0     600   47.7%

I want to thanks Adam Hair for his help in the presentation of graph and results.

( All the games can be downloaded here: TTC_All_Games )

One small problem, the 2SD Elo margins are 30Elo so all the results you presented are pretty much meaningless.

Laskos · Post by **Laskos** » Sun Dec 22, 2013 8:16 pm

Milos wrote:
One small problem, the 2SD Elo margins are 30Elo so all the results you presented are pretty much meaningless.

2SD are here ~20 Elo points, 1SD 10 Elo points and 84% confidence on one tail. Besides that, the points can be grouped 2 by 2, with 7 Elo points 1 SD. So the curves are fairly relevant.

michiguel · Post by **michiguel** » Sun Dec 22, 2013 8:57 pm

Laskos wrote:
Milos wrote:
One small problem, the 2SD Elo margins are 30Elo so all the results you presented are pretty much meaningless.
2SD are here ~20 Elo points, 1SD 10 Elo points and 84% confidence on one tail. Besides that, the points can be grouped 2 by 2, with 7 Elo points 1 SD. So the curves are fairly relevant.

It is a bit more straightforward to analyze if the rating numbers are run not "against the average" (errors are misleading, particularly for a low number of participants) but run against Houdini as reference.

For instance
ordo -q -p TCC31.pgn -a0 -A"Houdini 4" -W -s10000 -F90

(quiet, TCC31.pgn as input, center to 0, reference is Houdini, calculate white advantage, calculate errors with 10k simulations, confidence 90%)

Will give

Code: Select all

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Houdini 4       :    0.0   ----    340.0     600   56.7%
   2 Komodo TCEC     :  -46.5   19.5    282.0     600   47.0%
   3 Stockfish DD    :  -49.7   19.5    278.0     600   46.3%

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :    8.5   19.6    320.5     600   53.4%
   2 Houdini 4       :    0.0   ----    310.0     600   51.7%
   3 Komodo TCEC     :  -32.8   19.6    269.5     600   44.9%

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   17.0   19.6    328.0     600   54.7%
   2 Houdini 4       :    0.0   ----    307.0     600   51.2%
   3 Komodo TCEC     :  -33.9   19.7    265.0     600   44.2%

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   11.0   19.7    314.0     600   52.3%
   2 Houdini 4       :    0.0   ----    300.5     600   50.1%
   3 Komodo TCEC     :  -12.2   19.6    285.5     600   47.6%

   # PLAYER          : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish DD    :   21.7   19.4    313.0     600   52.2%
   2 Komodo TCEC     :   12.1   19.8    301.0     600   50.2%
   3 Houdini 4       :    0.0   ----    286.0     600   47.7%

There is no doubt that SF crosses Houdini comparing both extremes with 90% confidence. But you are right Kai, you can combine the intermediate TC data and the confidence will increase. The trend is not meaningless.

Miguel

Milos · Post by **Milos** » Sun Dec 22, 2013 8:59 pm

Laskos wrote:
Milos wrote:
One small problem, the 2SD Elo margins are 30Elo so all the results you presented are pretty much meaningless.
2SD are here ~20 Elo points, 1SD 10 Elo points and 84% confidence on one tail. Besides that, the points can be grouped 2 by 2, with 7 Elo points 1 SD. So the curves are fairly relevant.

Well you are wrong as usual. Even if you group it 2 by 2, 1SD is 10Elo, and quoting smaller numbers doesn't help since comparison is of 3 engines not 2 so 2SD is simply 30Elo, or if it suites you 28Elo

.

Laskos · Post by **Laskos** » Sun Dec 22, 2013 9:36 pm

Milos wrote:
Laskos wrote:
Milos wrote:
One small problem, the 2SD Elo margins are 30Elo so all the results you presented are pretty much meaningless.
2SD are here ~20 Elo points, 1SD 10 Elo points and 84% confidence on one tail. Besides that, the points can be grouped 2 by 2, with 7 Elo points 1 SD. So the curves are fairly relevant.
Well you are wrong as usual. Even if you group it 2 by 2, 1SD is 10Elo, and quoting smaller numbers doesn't help since comparison is of 3 engines not 2 so 2SD is simply 30Elo, or if it suites you 28Elo .

As is so usual with you, you cover your crass mistake by misleading statements. You were talking about the errors in the engines' ratings as being 30 Elo points 2SD against the average. EloStat gives simply mine 20 points 2SD against the average, and BayesElo 15 points 2SD. Miguel fixed Houdini's rating, and got 20 points almost 2SD in simulations, which can be used directly for comparison. So, Milos, shut up when you are wrong (most of the times), and don't mislead people here with your incorrect statements resulting in "plot is meaningless".

ouachita · Post by **ouachita** » Sun Dec 22, 2013 10:39 pm

the specific testing methods and data points can and will be debated, but the basic trend lines should make sense to anyone who has been paying even casual attention to recent match and testing results.

Milos · Post by **Milos** » Mon Dec 23, 2013 1:39 am

Laskos wrote:Miguel fixed Houdini's rating, and got 20 points almost 2SD in simulations, which can be used directly for comparison. So, Milos, shut up when you are wrong (most of the times), and don't mislead people here with your incorrect statements resulting in "plot is meaningless".

Ordo is crap as usual and 90% confidence rate is nowhere near 2SD. Maybe you should check basic statistics before you post stupid stuff.

And if anyone is misleading that are couple of you here fans of Komodo...

michiguel · Post by **michiguel** » Mon Dec 23, 2013 2:23 am

Milos wrote:
Laskos wrote:Miguel fixed Houdini's rating, and got 20 points almost 2SD in simulations, which can be used directly for comparison. So, Milos, shut up when you are wrong (most of the times), and don't mislead people here with your incorrect statements resulting in "plot is meaningless".
Ordo is crap as usual and 90% confidence rate is nowhere near 2SD. Maybe you should check basic statistics before you post stupid stuff.

90% confidence, is 90% confidence, regardless of how many SD or even what type of distribution we talk about. The point is, 90% is not meaningless.
Again, this is just considering the extreme. But there are three other conditions above 9'+3'' in which SF perfoms better than Houdini (2400 games total). That will easily take it above 95% confidence.

"Ordo is crap" = Why do you keep trolling every time you see Ordo in a post? The fact is that it is irrelevant how you calculate this. You can do it with BayesELO, calculate LOS, and get similar results (LOS, not necessarily the errors in BE, which could be misleading if you do not what options you choose). If you remove the draws included in BE the numbers are identical to Ordo.

And if anyone is misleading that are couple of you here fans of Komodo...

We are talking about SF. The fact is the statement that SF scales better than Houdini is not a joke, which makes us believe that what we saw in TCEC was not a fluke.

Miguel

PCM72 · Post by **PCM72** » Fri Dec 27, 2013 8:08 pm

Hi.
What book/testset has been used?
How much the quality of neutral starting positions is important in your opinion?

Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison

Re: Made In Heaven class Time Control Comparison