Another attempt at comparing Evals ELO-wise

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Lyudmil Tsvetkov »

thanks Kai.

I am fully certain SF eval is better than Shredder one.

As Mark rightly noted somewhere else, it should theoretically be fully impossible to separate in any meaningful way eval from search.

I guess only possibility to reliably measure eval prowess of engines is to let them play at depth=1, picking up the single best move statically, without any search.

as to what Daniel says, good eval and good tuning are closely synonymous, sometimes fewer well-tuned terms might overperform a bigger bunch of losely tuned ones, though usually there is a correlation between the two.

My supposition is that better eval and better search go hand in hand, so SF, Komodo and Houdini(yes, this one too, not in vain Robert Houdart mentioned most of the improvements in Houdini 5 had to do with eval) should have the best evals out there.

Of course, minor idiosyncracies will always persist.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Another attempt at comparing Evals ELO-wise

Post by cdani »

lkaufman wrote:
Laskos wrote:What you say might be important for longer analysis, where often LTC and eval are more important. Do you have any ideal if better eval could mean better scaling with time ELO-wise? This list is strangely similar in certain aspects to scaling of engines I derived from FGRL rating list.

To me it is obvious that better eval correlates with better scaling, although it is not a perfect correlation. Tactics become less important with more time, while errors in eval don't generally go away with more time, although perhaps there is a difference between static and dynamic eval features in this respect. Better eval usually takes more time, but the slowdown is probably fairly constant so the elo loss dissipates with increased depth while the elo gain from better eval may remain fairly constant or perhaps even grow.
Nicely explained.
lkaufman wrote: I'm a bit unclear on why you say that super-fast play measures eval. Is it so fast that search differences like LMR mostly vanish? What is the average search depth you get at these levels?
Kai, If you think it makes any sense, I can do a version of Stockfish and another of Andscacs that just plays picking the move that has the best eval.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Another attempt at comparing Evals ELO-wise

Post by cdani »

cdani wrote: Kai, If you think it makes any sense, I can do a version of Stockfish and another of Andscacs that just plays picking the move that has the best eval.
And maybe someone can do a version of Giraffe that does the same :-)
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Oops...

Post by Lyudmil Tsvetkov »

I got interested and, with some free time, played some quick dirty matches in Fritz gui between SF 8 and Robbolito 0.085g3.

I do not know how much reliable the Fritz gui depth-limiting function is but, believe or not, I got the following extremely surprising results:

depth 0(possibly this is what Fritz defines as static evaluation)
Robbolito - SF +0 -30 =0
depth 1
Robbolito - SF +15 -11 =4
depth 2
Robbolito - SF +22 -3 =5

as presumed, SF static eval completely overwhelms Robbolito one.
surprisingly, depth 1 performs much worse for SF, and depth 2 even worse.(who would have thought?)

I guess there are simply a bunch of factors we do not take into account with these simplistic measurements.(for example, SF will prune a lot a depth 1, and probably even more at depth 2; at higher depths, that might be compensated by the search, but not at these extremely small depths)

anyone surprised, or just me?
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

some more results:

depth 5
Robbolito - SF +19 -0 =1
depth 10
Robbolito - SF +9 -5 =6

obviously, fixed depth hurts enormously SF(but also presumably other engines), and it should in no way be used to measure anything related to eval.

with higher depth(but again fixed), SF starts to somewhat improve, but even at depth 10(who would have thought that?) is much inferior to basic Robbolito.

so, for me, forget fixed depth.

what I also observed is that play at depths up to 5 is completely absurd, not very strong at depth 10 too.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Another attempt at comparing Evals ELO-wise

Post by cdani »

cdani wrote:
cdani wrote: Kai, If you think it makes any sense, I can do a version of Stockfish and another of Andscacs that just plays picking the move that has the best eval.
And maybe someone can do a version of Giraffe that does the same :-)
If anyone does a compile like this, should take the evaluation returned by the quiescence function, not the plain static eval, to avoid nonsense sacrifices that raise temporarily the evaluation.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

another major surprise:

Komodo 10.1 - SF 8
depth 10 +8 -3 =9
Komodo - SF
depth 0 +0 -19 =1

if we have to go by these result, SF has vastly superior eval to any engine around.
Komodo is superior at depth 10.

but I already start doubting these results, as I see a match option for defining depth -1 and -2. what does depth -2 in Fritz gui mean?
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

last results, to satisfy Daniel and Andscacs fans:

Andscacs 0.86 - SF 8
depth 0
+0 -20 =0
Komodo - SF
depth 20
+5 -2 =13

so, Andscacs seems to be a long shot at best evaluator(if that is any measure to go by), while with more depth SF starts coming closer to Komodo, but Komodo is still on top.

Obviously, SF needs more depth to do well.
Or no depth at all.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

one more result:

Komodo 10.1 - Andscacs 0.86
depth 0
+20 -0 =0

so, Komodo beats Andscacs with dry score at this depth, but manages just a single draw against SF. I guess I might be doing something wrong, but who knows?

also tried a bit a match between SF and Komodo at depth 30.
SF reaches depth 30 about 5 times or so faster than Komodo, meaning that it will spend 5 times less time.

of course, no one will want one engine to play with 5 times shorter TC.
I guess the same is true of fixed number of nodes, you go to different depths to reach those nodes, and you evaluate differently, more or less heavily. So, some engines will be greatly favoured by nodes measure of strength, while others will only suffer.

so I guess strength measurements in terms of fixed depth and fixed nodes are fully meaningless, and only reliable tests should be done at TC.
depth 0 is also fine for me to measure eval, but who knows what else goes on in engines and where we ourselves go wrong?

first conclusions: engines are too complicated for anything certain to be concluded. I have reached that one. Maybe wiser people will make more specific ones.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Another attempt at comparing Evals ELO-wise

Post by Laskos »

lkaufman wrote: I'm a bit unclear on why you say that super-fast play measures eval. Is it so fast that search differences like LMR mostly vanish? What is the average search depth you get at these levels?
To me it is obvious that getting rid of much of the search should give a better image about eval. But it should be done objectively, and fixed depth=1 is objected to as being differently defined for different engines, fixed nodes the same. The remaining is "timenodes" of Stockfish, but I observed that the nodes speed of engines varies greatly with each engine in first thousands of nodes, being very different from the speed we see in GUI, and I cannot measure it. The last option is using very small fixed time, which is objective. It is hardly objected to (and not by Lyudmil). Also, I had a hybrid engine, Sungorus eval and Andscacs search, and Sungorus itself. Sungorus is a an UCI 1600 ELO or so engine. The only way to make them similar in strength was to use 0.005s time control. Other comparisons were incompatible.

The usual depth achieved by top engines was about 5. Fruit had 2 or so. I would have used much shorter time control, but it's unsupported both by engines (over- or under- stepping) and UI.