Evaluation discontinuity

jwes · Post by **jwes** » Wed Jun 02, 2010 2:05 am

Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.

Ideally, if you make the best move in any position, the evaluation should not change. If you make a less than best move, your evaluation should go down. When changing from the middlegame to the endgame, the best move can cause large changes in the evaluation while not changing the game-theoretic value of the position. If this is occurring, then clearly comparing middlegame values to endgame values can give wrong results.

PK · Post by PK » Wed Jun 02, 2010 9:42 am

As for explaining the pitfals of discontinuity, I have a funny example. A strange kind of discontinuity may arise because of special-case code for evaluating endgames. Let's say that You implement a piece of knowledge that KBP vs K ending is drawn when the pawn is a rook pawn, the Bishop does not control the promotion square, and the king of the weaker side controls it. You code it perfectly, yet the engine still goes into this endgame, because You haven't coded a set of rules for the KPK endgames with a rook pawn and it expects to get rid of a Bishop.

metax · Post by **metax** » Wed Jun 02, 2010 5:44 pm

PK wrote:As for explaining the pitfals of discontinuity, I have a funny example. A strange kind of discontinuity may arise because of special-case code for evaluating endgames. Let's say that You implement a piece of knowledge that KBP vs K ending is drawn when the pawn is a rook pawn, the Bishop does not control the promotion square, and the king of the weaker side controls it. You code it perfectly, yet the engine still goes into this endgame, because You haven't coded a set of rules for the KPK endgames with a rook pawn and it expects to get rid of a Bishop.

A similar case occurred to me. I wrote evaluation functions for several won endgames like KQK, KRK, KBNK etc. Because I knew this from other engines, I put in high scores like +80 or so. But then the engine liked to sacrifice material for example in positions like KQRKN. It sacrificed the queen for the knight and was happy when it could have won much faster. It also liked to promote to bishop in KNPK when it didn't see a mate or a way to get rid of the knight. Since then I disabled these +80 scores because the discontinuity annoyed me, although it probably didn't weaken the engine.

wgarvin · Post by **wgarvin** » Wed Jun 02, 2010 6:03 pm

If you haven't read it before, you might be interested in Heinz et al's paper about interior-node recognizers in DarkThought:

in section 2, they wrote: We abandoned the scheme of interior-node score bounds in 1996 when
it became clear that disjoint scoring ranges for recognizers and the static
evaluation tend to introduce frequent scoring inconsistencies during iter-
atively deepened searches of high depths. The inconsistencies are caused
by incomplete recognizer coverage of positions that represent subgames of
other positions for which recognizers exist. Pawn (under-) promotions often
lead to such subgames, especially if any recognized positions contain more
than one Pawn per side. This kind of recognizer incompleteness is hard to
avoid in practice unless the implementation restricts recognizers to solely
trivial cases. Hence, in our opinion Slate’s theoretically sound scheme of
interior-node score bounds with disjoint ranges for recognizer results and
static evaluation scores carries only little practical value for modern high-
speed chess programs.

in section 3.2, they wrote: In view of the hardly avoidable incompleteness of the recognizer coverage
(see Section 2) we suggest to employ recognizer scores that are compatible
with static evaluation scores. Hence, we base our recognizer scores mainly on
the material balances of the according positions if neither draws nor mates
get recognized.2 Compatibility with static evaluation scores has the further
advantage of enabling the implementation to replace expensive evaluations
by cheap recognitions. Simply skip the calls of the static evaluation function
in case of successful recognitions and reuse the already computed recognizer
scores as static evaluation scores.

They found that they needed to carefully tune their scoring heuristics so that they produced scores that were consistent with non-database evaluations of those endgames. If they didn't do this, they had problems with incomplete recognizers (which bailed out on certain "tricky" positions so they could have a cheaper 1-bit GTV encoding of the rest of the database) and with subgames that were not represented by a database and reverted to normal evaluation scores. If the scores are not in a consistent range, the engine would, for example, make dumb moves to get into a subgame that it thinks is better because of the inconsistent scores.

[Edit: replaced the URL with a better one. Its part of this set of pages about DarkThought, there's some other neat stuff in there too.]

jwes · Post by **jwes** » Wed Jun 02, 2010 6:53 pm

Similar situations often occur with egtbs. The program will sacrifice material to arrive in a won tb position. It is not wrong, but certainly ugly.

jwes · Post by **jwes** » Wed Jun 02, 2010 8:23 pm

The opposite problem also occurs, e.g. a human would sacrifice material to trade into a won endgame, while a program does not value the endgame position highly enough.

Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity