Another attempt at comparing Evals ELO-wise

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Another attempt at comparing Evals ELO-wise

Post by Laskos »

cdani wrote:
Kai, If you think it makes any sense, I can do a version of Stockfish and another of Andscacs that just plays picking the move that has the best eval.
Yes, sure, I am very interested. I would have gotten rid of more of the search if I had known how. Maybe Andscacs could even come atop Stockfish :D

Post the link, thanks!
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Werewolf »

Could you test Giraffe?! It'd be interesting to see how good that evaluation function is....
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Oops...

Post by cdani »

Lyudmil Tsvetkov wrote:one more result:

Komodo 10.1 - Andscacs 0.86
depth 0
+20 -0 =0

so, Komodo beats Andscacs with dry score at this depth, but manages just a single draw against SF. I guess I might be doing something wrong, but who knows?

also tried a bit a match between SF and Komodo at depth 30.
SF reaches depth 30 about 5 times or so faster than Komodo, meaning that it will spend 5 times less time.

of course, no one will want one engine to play with 5 times shorter TC.
I guess the same is true of fixed number of nodes, you go to different depths to reach those nodes, and you evaluate differently, more or less heavily. So, some engines will be greatly favoured by nodes measure of strength, while others will only suffer.

so I guess strength measurements in terms of fixed depth and fixed nodes are fully meaningless, and only reliable tests should be done at TC.
depth 0 is also fine for me to measure eval, but who knows what else goes on in engines and where we ourselves go wrong?

first conclusions: engines are too complicated for anything certain to be concluded. I have reached that one. Maybe wiser people will make more specific ones.
Such depth 0 search probably is very differently defined for each engine, so not very comparable.

I think a clearly better way is the one I proposed, to modify the engine to be sure that it does just quiescence search. I will do it later and post here such versions. Maybe some other ppl wants to do it for different engines.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

another oops and a major correction: just noticed that the very buggy Fritz gui considers both depth 0, depth -1 and depth -2 for depth 1, and, more importantly, the goddam cheating SF treats depth 0 as depth 5 or 6, while all other engines do the right thing?!

also, looking at some games, Fritz will not allow quiescent search, so all captures are treated by engines like quiet moves with static eval.

I do not know who is more to blame: sleepy me, cheating SF or the Fritz guy, but one way or another, above reported results were mostly biassed. (unless the engine window also reports untrue, which I would doubt, given it is close to impossible to win with a perfect score)

redoing some tests, this time with depth set to 1(SF also understands this), seemingly, the strongest eval engine is Komodo, followed by Andscacs. ( I do not have latest Houdini)

both Komodo and Andscacs win convincingly against SF, and Komodo and Andscacs fighted hard, with Komodo coming on top after 100 games: +47 -27 =26(and I used very short unbiassed openings)

so, if that is any measure of eval prowess, quite probably Komodo would be top, followed by Andscacs(any other engine I have not tested)

of course, this pretty much means nothing, as evals are tuned in concert with search and very much inseparable from it, and apart from that even quiescent search is not included(SF might have been doing it, hence the bigger depth)

at any rate, that is a far better measure than testing with fixed depth or fixed number of nodes, for at least conditions would be equal to all.
fixed depth and fixed nodes, on the other hand, greatly distort the result, with some engines enjoying much more time than others.

I would always test with a fully-fledged engine though with TC, and not tearing the untearable apart.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

ouch, I do not think I will touch cheating SF any time soon :)
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Oops...

Post by Lyudmil Tsvetkov »

cdani wrote:
Lyudmil Tsvetkov wrote:one more result:

Komodo 10.1 - Andscacs 0.86
depth 0
+20 -0 =0

so, Komodo beats Andscacs with dry score at this depth, but manages just a single draw against SF. I guess I might be doing something wrong, but who knows?

also tried a bit a match between SF and Komodo at depth 30.
SF reaches depth 30 about 5 times or so faster than Komodo, meaning that it will spend 5 times less time.

of course, no one will want one engine to play with 5 times shorter TC.
I guess the same is true of fixed number of nodes, you go to different depths to reach those nodes, and you evaluate differently, more or less heavily. So, some engines will be greatly favoured by nodes measure of strength, while others will only suffer.

so I guess strength measurements in terms of fixed depth and fixed nodes are fully meaningless, and only reliable tests should be done at TC.
depth 0 is also fine for me to measure eval, but who knows what else goes on in engines and where we ourselves go wrong?

first conclusions: engines are too complicated for anything certain to be concluded. I have reached that one. Maybe wiser people will make more specific ones.
Such depth 0 search probably is very differently defined for each engine, so not very comparable.

I think a clearly better way is the one I proposed, to modify the engine to be sure that it does just quiescence search. I will do it later and post here such versions. Maybe some other ppl wants to do it for different engines.
right Daniel, I/Fritz/SF mixed things up, but I only noticed that at a much later stage. (who would suppose SF is cheating?)

in any case, I do not suppose Andscacs reputation has suffered a lot...
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Oops...

Post by cdani »

Lyudmil Tsvetkov wrote: in any case, I do not suppose Andscacs reputation has suffered a lot...
No problem about it :-) I make it for fun, and sure it has a lot of weak points. I try to improve them and thats all.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Lyudmil Tsvetkov »

cdani wrote:
cdani wrote:
cdani wrote: Kai, If you think it makes any sense, I can do a version of Stockfish and another of Andscacs that just plays picking the move that has the best eval.
And maybe someone can do a version of Giraffe that does the same :-)
If anyone does a compile like this, should take the evaluation returned by the quiescence function, not the plain static eval, to avoid nonsense sacrifices that raise temporarily the evaluation.
I should admit I simply do not understand quite a lot of things going on here.

does 1-ply search with quiescence included guarantee objectiveness of the test? what if quiescence definitions and techniques differ among engines(and they do) ? in this way, including quiescence search will still be biassed(maybe some engines do part of the work of static eval in quiescence)

and without quiescence, would not that be prejudicial to engines with better quiescence routines?

one thing completely makes no sense is that SF has worse eval than Andscacs and Houdini 3. I have always presumed Komodo eval is better tuned, as a lot of SF term values simply make no sense, but SF eval worse than that of engines more than 300 elo weaker? that simply makes no sense. I understand searching deeper provides some additional advantages, like more frequently reaching positions with sufficiently big eval to point at a clear winner, but, as a matter of fact, more than 3/4 of all possible nodes should more or less return purely evaluation positions, with no clear advantage to either side, where precise eval is all-important. SF, as well as weaker engines, should reach them equally. How would SF perform so much better, if its eval is so much worse? it simply does not make sense to me.

SF eval is extensively tuned at the framework, much more so than that for the weaker engines. Why it would be so much worse tuned?

I guess the answer might lie in the fact that every engine achieves its relative optimum of eval tuning only within the particular framework of its eval+search. SF will tune its eval at deeper nodes, so it does make sense that it is only worse tuned for shallower nodes. But that does not necessarily mean its eval per se is worse. Rather that, because of being tuned at higher depths, it is not optimally tuned for shallower depths.

it is true static eval should remain the same for all nodes, but that does not necessarily mean tuning it with different search parameters should lead to identical results. I guess that, if SF would like to specifically tune for 1-ply searches, they could run some couple of billion games per day and achieve some couple of hundreds elo increase within days or months. SF would become one of the best evaluators. But then, that would certainly perform worse with SF search activated. Why so? Does not better static eval also lead to better game play overall?

There is something I clearly do not understand here.

Having watched so many games of different engines, I would still support my claim SF eval is better than that of a lot of other weaker engines that would otherwise perform better at 1-ply searches. Maybe, until we understand all the intricacies of engines, we should instead concentrate on just comparing engine strength at regular time control.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Another attempt at comparing Evals ELO-wise

Post by cdani »

So here it is:

www.andscacs.com/varis/test_quiesce.zip

Contains:

Stockfish_quiesce.exe. Plays always searching all the root nodes plus calling quiesce with open window, and takes just the best move.

Andscacs_quiesce.exe. The same for Andscacs.

The modified Stockfish source is include. I have changed only search.cpp. It has the changes marked with //170523. If anyone finds some error just tell me and I will do a new version.

I included also test.bat, a cutechess-li bat file that I used to run the test that generated the included pgn file.

The search was at fixed depth 1. The result was:

Code: Select all

 # PLAYER                        : RATING  ERROR   POINTS  PLAYED    (%)
 1 Stockfish 230517 64 POPCNT    : 2900.7    5.0   1907.0    3052   62.5%
 2 Andscacs 0.91028              : 2811.3    5.0   1145.0    3052   37.5%
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Another attempt at comparing Evals ELO-wise

Post by Laskos »

cdani wrote:So here it is:

www.andscacs.com/varis/test_quiesce.zip

Contains:

Stockfish_quiesce.exe. Plays always searching all the root nodes plus calling quiesce with open window, and takes just the best move.

Andscacs_quiesce.exe. The same for Andscacs.

The modified Stockfish source is include. I have changed only search.cpp. It has the changes marked with //170523. If anyone finds some error just tell me and I will do a new version.

I included also test.bat, a cutechess-li bat file that I used to run the test that generated the included pgn file.

The search was at fixed depth 1. The result was:

Code: Select all

 # PLAYER                        : RATING  ERROR   POINTS  PLAYED    (%)
 1 Stockfish 230517 64 POPCNT    : 2900.7    5.0   1907.0    3052   62.5%
 2 Andscacs 0.91028              : 2811.3    5.0   1145.0    3052   37.5%
Thanks, I will play a bit.