Page 1 of 6

Another attempt at comparing Evals ELO-wise

Posted: Mon May 22, 2017 1:06 pm
by Laskos
I had the hunch to test top engines not at low fixed nodes ("nodes" are pretty relative to engine), not at fixed low depth (this seems again relative), but at fixed very short time. The problem with this is that in Cutechess-Cli on Windows engines overstep their allotted very short time by several milliseconds (for example, using ST=0.001 command in Cutechess-Cli). I inferred the overstepping in ms from their behaviour under doubling time control. Komodo 10.4 sees clearly the doubling already at 0.002s vs 0.001 time control, while Stockfish sees it only at 0.008 vs 0.004 or higher time control. Thus, the latency of Komodo is 0.000, of Stockfish 0.004.

Here is the table of latencies in milliseconds (Windows 8.1 and Cutechess-Cli):

Code: Select all

Komodo 10.4:      0
Stockfish 8:      4 
Houdini 5:        4 
Deep Shredder 13: 0
Andscacs 0.91:    0
Fruit 2.1:       12
Doubling at these short times is about 250 ELO points, and if I use time control 0.005/move, I can infer adjusted ratings for effective time used. Unadjusted (games at 5ms/move):

Code: Select all

   # PLAYER               : TIME   RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)

   1 Stockfish 8          : 9 ms   1664.8   13.5    1470.0    2000    73.5     100    
   2 Houdini 5            : 9 ms   1607.5   12.3    1310.5    2000    65.5     100    
   3 Deep Shredder 13     : 5 ms   1517.0   12.5    1040.0    2000    52.0      70    
   4 Komodo10.4           : 5 ms   1512.0   12.2    1025.0    2000    51.3     100    
   5 Andscacs 0.91        : 5 ms   1446.3   12.7     827.0    2000    41.4     100    
   6 Fruit 2.1            :17 ms   1252.5   15.2     327.5    2000    16.4     ---    
Adjusted for time used:

EVAL RATING:

Code: Select all

   # PLAYER                 : RATING  

   1 Deep Shredder 13       :  687    
   2 Komodo 10.4            :  682
   3 Stockfish 8            :  625      
   4 Andscacs 0.91          :  616  
   5 Houdini 5              :  567   
   6 Fruit 2.1              :    0     
It seems Deep Shredder and Komodo have the best eval of top engines, while Houdini the weakest. Also, the progress from Fruit 2.1 basic eval is remarkable. Andscas seems on par with Stockfish, and only search is hampering it. The same for Shredder.

Re: Another attempt at comparing Evals ELO-wise

Posted: Mon May 22, 2017 3:03 pm
by cdani
Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Laskos wrote:Andscacs seems on par with Stockfish, and only search is hampering it.
Andscacs eval is more complete than the Stockfish one, but is clearly less well tuned. So its compensating precission with quantity. Also should be tuned for long time control. I think I can grew it clearly more, and overcome Stockfish relative simple eval is not very complicated.

Abou search, well, is not easy at all :-)

Also Andscacs is losing probably like 30-50 elo only due to speed, as it has been written as my first serious engine, thus many things on it are less than optimally written, even if I have rewritten most parts of it various times.

Houdini eval is suprising. Maybe is too much simple, who knows.

Re: Another attempt at comparing Evals ELO-wise

Posted: Mon May 22, 2017 6:00 pm
by sandermvdb
cdani wrote:Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Maybe a stupid question, but why is it important to have fine grained time? Is this the one used in the uci output (time) and in some way used by the GUI (or cutechess cli)?

Re: Another attempt at comparing Evals ELO-wise

Posted: Mon May 22, 2017 6:53 pm
by cdani
sandermvdb wrote: Maybe a stupid question, but why is it important to have fine grained time? Is this the one used in the uci output (time) and in some way used by the GUI (or cutechess cli)?
I tried other ways, but this one allowed to run at faster time controls than other ways without losing on time. As you can see this system should have no overhead.

Of course it does not work for Linux, for which I used chrono::steady_clock.

Re: Another attempt at comparing Evals ELO-wise

Posted: Mon May 22, 2017 9:54 pm
by Laskos
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian, but "ELO" since long transcended the proper name "Elo" as a unit of relative strength. Also, ELO as used in computer chess is not FIDE Elo and is not what Arpad Elo did (he used normal distribution, for once). I use ELO because it suits better me when others are reading fast my posts, as it happens on forums.

Re: Another attempt at comparing Evals ELO-wise

Posted: Mon May 22, 2017 10:31 pm
by Sven
Laskos wrote:
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian, but "ELO" since long transcended the proper name "Elo" as a unit of relative strength.
But everyone writes Watt, Kelvin or Newton and not "WATT", "KELVIN" or "NEWTON". So why "ELO"? Many people using it wrongly doesn't make it right ...
Laskos wrote:Also, ELO as used in computer chess is not FIDE Elo and is not what Arpad Elo did (he used normal distribution, for once).
That could be a reason to use a different name than Elo but not to write "ELO" instead of Elo.

Re: Another attempt at comparing Evals ELO-wise

Posted: Tue May 23, 2017 12:00 am
by Laskos
Sven Schüle wrote:
Laskos wrote:
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian, but "ELO" since long transcended the proper name "Elo" as a unit of relative strength.
But everyone writes Watt, Kelvin or Newton and not "WATT", "KELVIN" or "NEWTON". So why "ELO"? Many people using it wrongly doesn't make it right ...
Laskos wrote:Also, ELO as used in computer chess is not FIDE Elo and is not what Arpad Elo did (he used normal distribution, for once).
That could be a reason to use a different name than Elo but not to write "ELO" instead of Elo.
An important issue worth mentioning.

Re: Another attempt at comparing Evals ELO-wise

Posted: Tue May 23, 2017 12:24 am
by Laskos
cdani wrote:Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Laskos wrote:Andscacs seems on par with Stockfish, and only search is hampering it.
Andscacs eval is more complete than the Stockfish one, but is clearly less well tuned. So its compensating precission with quantity. Also should be tuned for long time control. I think I can grew it clearly more, and overcome Stockfish relative simple eval is not very complicated.

Abou search, well, is not easy at all :-)

Also Andscacs is losing probably like 30-50 elo only due to speed, as it has been written as my first serious engine, thus many things on it are less than optimally written, even if I have rewritten most parts of it various times.

Houdini eval is suprising. Maybe is too much simple, who knows.
What you say might be important for longer analysis, where often LTC and eval are more important. Do you have any ideal if better eval could mean better scaling with time ELO-wise? This list is strangely similar in certain aspects to scaling of engines I derived from FGRL rating list.

Re: Another attempt at comparing Evals ELO-wise

Posted: Tue May 23, 2017 1:13 am
by lkaufman
Laskos wrote:
cdani wrote:Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Laskos wrote:Andscacs seems on par with Stockfish, and only search is hampering it.
Andscacs eval is more complete than the Stockfish one, but is clearly less well tuned. So its compensating precission with quantity. Also should be tuned for long time control. I think I can grew it clearly more, and overcome Stockfish relative simple eval is not very complicated.

Abou search, well, is not easy at all :-)

Also Andscacs is losing probably like 30-50 elo only due to speed, as it has been written as my first serious engine, thus many things on it are less than optimally written, even if I have rewritten most parts of it various times.

Houdini eval is suprising. Maybe is too much simple, who knows.
What you say might be important for longer analysis, where often LTC and eval are more important. Do you have any ideal if better eval could mean better scaling with time ELO-wise? This list is strangely similar in certain aspects to scaling of engines I derived from FGRL rating list.

To me it is obvious that better eval correlates with better scaling, although it is not a perfect correlation. Tactics become less important with more time, while errors in eval don't generally go away with more time, although perhaps there is a difference between static and dynamic eval features in this respect. Better eval usually takes more time, but the slowdown is probably fairly constant so the elo loss dissipates with increased depth while the elo gain from better eval may remain fairly constant or perhaps even grow. This could be tested by, for example, making a version of Stockfish with the basic material values distorted, perhaps by reducing all of them by a constant like 50 "SF" points, while giving the distorted version double time at various time limits. Or perhaps cut the total of pawn structure in half with double time.
I'm a bit unclear on why you say that super-fast play measures eval. Is it so fast that search differences like LMR mostly vanish? What is the average search depth you get at these levels?

Re: Another attempt at comparing Evals ELO-wise

Posted: Tue May 23, 2017 1:56 am
by elpapa
Laskos wrote:
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian
Let's just be glad his last name wasn't Oberknezsevics.