Stable and/or accurate eval

smirobth · Post by **smirobth** » Thu Jul 05, 2007 12:05 am

AGove wrote:At first glance the following reinforces the general impression that Rybka slightly understates while Shredder greatly overstates an advantage. Having said that, Robin Smith claims "Black is close to being lost, and perhaps is already completely lost " - so perhaps in this case Shredder is actually right and Rybka not. That's the problem with single examples.

[d]1r2nrk1/6pp/1b1p1pP1/3RnP2/7P/2B5/1PN5/1K3B1R w - - 0 1

Good points. I can add that from right around this position or shortly thereafter I ran some engine-engine tournaments. This can sometimes be a good way of telling which engine's eval is closest to the truth in a particular position. In this position, if my memory serves, the results were around 75% for White (about 50% wins, 50% draws). So I think Shredder's +1.72 eval is a bit high. Even though with perfect play the position may in fact be won for White this is far from certain and I think a 75% engine-engine tournament result would be better described with an eval closer to +1.0.

smirobth · Post by **smirobth** » Thu Jul 05, 2007 1:15 am

AGove wrote:
there is no accurate evaluation.
reason is the positions HAVE no accurate evaluation.

only mate scores can be accurate.
everything far from mate is unclear. therefore an engine cannot tell you WHICH value is right.
Tablebase draws can be accurately evaluated as 0.00, but I'd like to agree with what you say. When one engine says +0.32 and another +0.68 how can it be proven which is right? In fact Lev Alburt's book Test and Improve Your Chess suggested a way of evaluating positions based on the results obtained when the positions are played out - which of course isn't how engines analyse at all, but could be a way of checking and objectifying their numerical evaluations.

I make a similar suggestion, using engine - engine tournaments to play out positions, in my book Modern Chess Analysis. There are a number of pitfalls to look out for when using engine - engine tournaments to evaluate a position (or make a move choice), but sometimes it is the best option.

Also one needs to know the purpose of the evaluation. Is it only to know the objective perfect-play truth about a position? In this case a 0.00 evaluation for a tablebase draw is appropriate. But what if the purpose is to know what the practical chances are for each side with imperfect play, for example play between two humans, or programs that do not have access to tablebases? In such a case a 0.00 evaluation may not be correct in many cases, even when it is correct from the perfect play standpoint.

maxchgr · Post by **maxchgr** » Thu Jul 05, 2007 5:28 am

you like to use at least 4 engines? do you have a quad? i have a dual core so i restrict myself to two engines at a time

smirobth · Post by **smirobth** » Thu Jul 05, 2007 7:29 am

maxchgr wrote:you like to use at least 4 engines? do you have a quad? i have a dual core so i restrict myself to two engines at a time

Yes, a quad core Opteron. But even if I did not have a quad I would frequently use more than 2 engines. When I only had a dual I would often use two engines for a while, and then switch to two different ones. And for people who have only a single core I still think they should usually use more than one engine for analysis. IMO.

AGove · Post by **AGove** » Thu Jul 05, 2007 9:25 am

Also one needs to know the purpose of the evaluation. Is it only to know the objective perfect-play truth about a position? In this case a 0.00 evaluation for a tablebase draw is appropriate. But what if the purpose is to know what the practical chances are for each side with imperfect play, for example play between two humans, or programs that do not have access to tablebases? In such a case a 0.00 evaluation may not be correct in many cases, even when it is correct from the perfect play standpoint.

Remember that we're talking about numerical evaluations here. We can't yet expect chess engines to give meaningful result probabilities after a few tens of seconds of analysis, which is the time that we do reasonably expect them to give meaningful centipawn evaluations. What the discussion should now be is how to calibrate those evaluations and standardise them. By convention 0.00 (=) is a draw, +3.00 (+-) is a win. Very well then, but what is +0.68? Three pawns is not checkmate, and 0.68 of a pawn is not to be found on a chessboard (and if one was, no doubt it could only move 0.68 squares forward). From our discussions, it seems that two well-known engines, the best in the world in their time, are not giving "accurate" or let's say useful and comparable evaluations. The general impression is that Rybka slightly understates while Shredder greatly overstates an advantage. So how should they, and all other engines, evaluate non-decisive positions?

smirobth · Post by **smirobth** » Thu Jul 05, 2007 10:47 am

AGove wrote:
Also one needs to know the purpose of the evaluation. Is it only to know the objective perfect-play truth about a position? In this case a 0.00 evaluation for a tablebase draw is appropriate. But what if the purpose is to know what the practical chances are for each side with imperfect play, for example play between two humans, or programs that do not have access to tablebases? In such a case a 0.00 evaluation may not be correct in many cases, even when it is correct from the perfect play standpoint.
Remember that we're talking about numerical evaluations here. We can't yet expect chess engines to give meaningful result probabilities after a few tens of seconds of analysis, which is the time that we do reasonably expect them to give meaningful centipawn evaluations. What the discussion should now be is how to calibrate those evaluations and standardise them. By convention 0.00 (=) is a draw, +3.00 (+-) is a win. Very well then, but what is +0.68? Three pawns is not checkmate, and 0.68 of a pawn is not to be found on a chessboard (and if one was, no doubt it could only move 0.68 squares forward). From our discussions, it seems that two well-known engines, the best in the world in their time, are not giving "accurate" or let's say useful and comparable evaluations. The general impression is that Rybka slightly understates while Shredder greatly overstates an advantage. So how should they, and all other engines, evaluate non-decisive positions?

There is no reason that centipawn (numeric) evaluations could not be correlated with a perfectly meaningful estimated statistical evaluation, which is all I was talking about. For example having the first move is often considered to be about a 1/3 pawn advantage, and is known to give about a 55% score for White. A full (actual) pawn advantage, all else being equal, is perhaps about 75%. So I would think your hypothetical case of a 0.68 eval should fall somewhere in between these two, perhaps around 65%. But all of this is mostly arbitrary and a matter of taste. If program "X" were to one day multiple every evaluation it produces by a factor of 10 it would still play in exactly the same way. The numbers themselves make no difference to the programs, only to the humans trying to decipher their significance in terms of winning chances. I think all this is a bit of a distraction from the most important thing, which is to just understand the idiosyncrasies of the various engines you are using. The hypothetical 10x engine is just as good as the 1x engine for analysis purposes, as long as you know about this peculiar characteristic. This is why I often still use Shredder for analysis, even though its evals must be reduced if you want to compare them to most other engines.

bedouin · Post by **bedouin** » Thu Jul 05, 2007 1:29 pm

maxchgr wrote:I think that generally I trust rybka's evaluation most since it is the strongest + has an extremely stable eval. I think adding another strong engine that can cover rybka's flaws, like fritz or hiarcs for understanding king attacks makes for a complete package to be able to evaluate positions with confidence.

i was wondering if anybody else noticed that spike seems to always completely disagree with most other strong engines, or is that just me.

I would presume that that's postal chess you use it for? What GUI do you use? Do you leave your machine(s) running for days on end or how do you decide what moves are best to play. As my analysis is mostly checking my oversights in casual online games any engine like crafty or better will do (Even on the Chess Tactics Server they use Crafty 19.19 to show the answer) but I like the tactics recommended by Junior.

AGove · Post by **AGove** » Thu Jul 05, 2007 2:12 pm

There is no reason that centipawn (numeric) evaluations could not be correlated with a perfectly meaningful estimated statistical evaluation...

Yes, there is no reason, although I haven't heard of it being done systematically.

For example having the first move is often considered to be about a 1/3 pawn advantage, and is known to give about a 55% score for White...

Yes, that sounds a reasonable start. However, we needn't guess about this, we can study it properly, statistically, and then calibrate our engines accordingly.

But all of this is mostly arbitrary and a matter of taste.

What? You had said that centipawn evaluations could be correlated with meaningful score probabilities (which you called estimated statistical evaluations). Now you appear to be saying that the correlation is mostly arbitrary and a matter of taste. That's not what a correlation is.

If program "X" were to one day multiple every evaluation it produces by a factor of 10 it would still play in exactly the same way. The numbers themselves make no difference to the programs, only to the humans trying to decipher their significance in terms of winning chances.

But we were hoping that the program itself could tell us about the winning chances.

I think all this is a bit of a distraction from the most important thing, which is to just understand the idiosyncrasies of the various engines you are using.

The thread is about the accuracy (and stability) of evaluations. So idiosyncracies in evaluation are very much in focus here.

I often still use Shredder for analysis, even though its evals must be reduced if you want to compare them to most other engines.

So if you don't want to compare them to most other engines, you'll be treating Shredder's evaluations as absolutes? Remember that centipawn evaluations are initially grounded on simple material imbalances.

jwes · Post by **jwes** » Thu Jul 05, 2007 11:48 pm

From a programming perspective, the only thing that matters is that better positions have higher evaluations. The actual values of the evaluations are largely irrelevant. A way to compare evaluations among engines is have each evaluate a number of positions and find the positions that some engines rank very differently. Then try to figure out which evaluation is better.

smirobth · Post by **smirobth** » Fri Jul 06, 2007 12:13 am

AGove wrote:
There is no reason that centipawn (numeric) evaluations could not be correlated with a perfectly meaningful estimated statistical evaluation...
Yes, there is no reason, although I haven't heard of it being done systematically.
For example having the first move is often considered to be about a 1/3 pawn advantage, and is known to give about a 55% score for White...
Yes, that sounds a reasonable start. However, we needn't guess about this, we can study it properly, statistically, and then calibrate our engines accordingly.
But all of this is mostly arbitrary and a matter of taste.
What? You had said that centipawn evaluations could be correlated with meaningful score probabilities (which you called estimated statistical evaluations). Now you appear to be saying that the correlation is mostly arbitrary and a matter of taste. That's not what a correlation is.
If program "X" were to one day multiple every evaluation it produces by a factor of 10 it would still play in exactly the same way. The numbers themselves make no difference to the programs, only to the humans trying to decipher their significance in terms of winning chances.
But we were hoping that the program itself could tell us about the winning chances.
I think all this is a bit of a distraction from the most important thing, which is to just understand the idiosyncrasies of the various engines you are using.
The thread is about the accuracy (and stability) of evaluations. So idiosyncracies in evaluation are very much in focus here.
I often still use Shredder for analysis, even though its evals must be reduced if you want to compare them to most other engines.
So if you don't want to compare them to most other engines, you'll be treating Shredder's evaluations as absolutes? Remember that centipawn evaluations are initially grounded on simple material imbalances.

There is no such thing as absolute accuracy, every engine does things differently. Just because Shredders evals are more optimistic most of the time compared to other engines does not mean they are less accurate. I never think of "evaluations as absolutes".

Remember that material imbalances affect other things. Someone who has an extra bishop typically gets 300 centi-pawns for that. But perhaps an engine gives an additional bonus for increased mobility, if it is based on number of legal moves. And perhaps another bonus for the two bishops. And perhaps other bonuses or penalties based on whether the pawn structure is open or closed. You might penalize a bishop in a late endgame if there are few pawns and one of them is a rook pawn of the wrong color. Many things go into an evaluation and every engine does this somewhat differently. If you want to correlate centi-pawns to winning percentages this can of course be done, but the result will be different for each engine. There is no single correlation that can be done and be equally valid for very engine. Engines all evaluate these things differently, there is no right answer, no single correct "absolute accuracy". Winning statistics cannot ever be absolute either, even apart from evaluations, since it will depend on who is playing. Perfect play would always lead to 100%, 50% or 0% depending on the position, but imperfect play percentages will always depend on who is playing, the time control etc. The concept of "absolute accuracy" in the context of chess engines is meaningless.

But one can say that most of the time Shredder's evaluations tend to be very optimistic, and Rybka's tend to be slightly pessimistic ... when compared with most other engines.

Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval

Re: Stable and/or accurate eval