engine evaluation and chess informant symbols

Don · Post by **Don** » Wed Jan 02, 2013 5:04 pm

gladius wrote:
Don wrote:The evaluation of this program is reasonable. I put a morra gambit in and the score is not ridiculous. A classic morra position should return a score close to zero despite being a pawn down and probably be considered approximately equal. I get that white is down about 0.1 which means the evaluation is good enough to see a lot of compensation.
Well, reasonable might be an overstatement . It's just psq-tables and mobility. But it gets the job done against most humans, even in javascript.

It's reasonable by some definition of reasonable

But we have to do the best with what we have. Is there any pawn structure evaluation in this program?

casaschi · Post by **casaschi** » Wed Jan 02, 2013 5:12 pm

Don wrote:I don't believe there will be a problem with these as they are all annoated by strong players. Not all are grandmasters but that is not going to make a huge difference.

I dont think the difference relates to the strength of the annotator. However, I believe there's a difference if you annotate few games in a hurry for a blog that nobody reads or if you annotate games for the next chess informant. I'd rather use the notes of an IM for the latter than the ones of a GM for the former.

Don · Post by **Don** » Wed Jan 02, 2013 5:13 pm

casaschi wrote:
Don wrote:I don't believe there will be a problem with these as they are all annoated by strong players. Not all are grandmasters but that is not going to make a huge difference.
I dont think the difference relates to the strength of the annotator. However, I believe there's a difference if you annotate few games in a hurry for a blog that nobody reads or if you annotate games for the next chess informant. I'd rather use the notes of an IM for the latter than the ones of a GM for the former.

I'll make a tool that can be extended with whatever data source we decide to use. If you can produce some well annotated games I will be happy to use them.

gladius · Post by **gladius** » Wed Jan 02, 2013 5:19 pm

Don wrote:I don't know if it's possible to feed the positions I will extract to the javascript program but that will not be necessary. I can do a few tests of common position to build a conversion factor.

You can give FENs to the javascript engine with InitializeFromFen("..."). Github project is here: https://github.com/glinscott/Garbochess-JS.

Or you can use the web-page, which has a FEN input (just paste into FEN box, and press enter), here: http://forwardcoding.com/projects/ajaxchess/chess.html.

Don · Post by **Don** » Wed Jan 02, 2013 7:25 pm

casaschi wrote:
Don wrote:I don't believe there will be a problem with these as they are all annoated by strong players. Not all are grandmasters but that is not going to make a huge difference.
I dont think the difference relates to the strength of the annotator. However, I believe there's a difference if you annotate few games in a hurry for a blog that nobody reads or if you annotate games for the next chess informant. I'd rather use the notes of an IM for the latter than the ones of a GM for the former.

Ok, I have preliminary results. Some files I could not parse and I also had trouble with Houdini because it violates the UCI standard in one particular area. I tried Robbolitto instead and discovered that it has the same violation so I fixed my tool to accommodate these non-standard programs.

The numbers are not pretty - we really need a lot more data. I did not try to parse the comments but those would give us a lot more positions to add to the sample.

I basically took samples and I am displaying the MEDIAN sample point.

Here is the result from Komodo:

Code: Select all

   -+   -167.8     36 samples
   +-    146.5     60 samples
    =     14.5     57 samples
  -/+    -68.0     39 samples
  +/-     88.0     61 samples
  +/=     42.0     46 samples
  =/+     -9.8     42 samples

Now the result from Houdini 3:

Code: Select all

   -+   -197.5     36 samples
   +-    145.5     61 samples
    =      8.0     57 samples
  -/+    -45.5     39 samples
  +/-     77.0     61 samples
  +/=     36.8     46 samples
  =/+    -14.5     42 samples

And Stockfish:

Code: Select all

   -+   -255.0     36 samples
   +-    211.5     61 samples
    =     22.0     57 samples
  -/+    -96.5     39 samples
  +/-    113.0     61 samples
  +/=     65.0     46 samples
  =/+    -14.0     42 samples

I'll give a little more detail in the next post.

Don · Post by **Don** » Wed Jan 02, 2013 7:40 pm

Some more information on how I did this. I ran to depth 10 and averaged the final scores for depth 9 and 10. With UCI I have to convert the score to the white point of view to get the right interpretation for the glyphs.

I used this translation table to covert from the glyph value such as $10 to the appropriate string:

10 "="
11 "="
12 "="
14 "+/="
15 "=/+"
16 "+/-"
17 "-/+"
18 "+-"
19 "-+"
20 "ww"
21 "bw"

ww and bw means the position is crushing but apparently there were none of these glyphs in the PGN files I used. There was no obvious translation either.

We see a pretty huge bias for one color in this data which I intuitively expected but it would be nice to have a lot more data to be sure. Annotators apparently are much more willing to give black the benefit of the doubt, for example black only needs a 55 advantage to be consider kicking butt but white must have an 83 advantage. To have a slight advantage black only has to equalize! This is probably psychological because black has to (in some sense) outplay white to get equality and perhaps the annotators are inclined to over-rate blacks position because of this. Just a theory.

Don · Post by **Don** » Wed Jan 02, 2013 7:52 pm

If you average the black and white disparity you would get a table like this:

= 0
+/= 16.15
+/- 63.0
-+ 157.15

The question is where to put the threshold as these are median values, not thresholds.

I could do a best fit - to see which thresholds would produce the highest number of correct annotations according to the data I have but I don't intend to go any farther until I get a lot more data.

Don

gbtami · Post by **gbtami** » Wed Jan 02, 2013 10:32 pm

Another good source is:
http://www.endgame.nl/

casaschi · Post by **casaschi** » Thu Jan 03, 2013 12:18 pm

I'm doing some experiments also in the same direction as Don. From the page Don linked I found a good file with about 100 games with chess informant annotations, good quality and a reasonable number of position assessed, about 700, more or less 100 for each of the 7 symbols. Still not statistically significant, but good enough to practice.

I just started, but already have some comments looking at the evaluation results as they are produced:

1) what amount of time to use for the position eval? I'm using a 1m evaluation with garbochess, that's probably needed for a good stable result, but at the same time will tune the assessment on a level that most users will never use... most users of garbochess with my application will let the engine running for a much shorter time.

2) the engine does not really get some of the position... or the annotator was wrong in the assessment. How to spot and exclude those position since they clearly should not be used here. I mean, if I try to decide the appropriate eval thresholds for the +/= sign and garbochess thinks that a given position is actually better for Black, that position should clearly not be used to tune the +/= sign. At the same time, too much of this manual selection of the position will likely lead the results anywhere I want the results to be...

3) I'm even wondering if all this actually makes sense. In other words, does it makes sense to use the engine score as an absolute position evaluation? Looking at some positions, the answer is not so obvious... for example, garbochess seems to have a significant premium for the Bishops pair. So an opening position where Black has the Bishops pair and doubled pawns is assessed by chess informant as slighly better for White, while garbochess scores the position as -0.7 pawns. Could it be that the engine eval cannot actually be assumed an objective evaluation of the position but only as a tool that drives engines to make the best move selection for them? Translating evals in the chess informant symbols somehow assumes that position with similar engine eval have similar winning chances... but that might not be true at all and a +0.2 in position A might mean something completely different than +0.2 in a very different position B

Ralf Müller · Post by **Ralf Müller** » Thu Jan 03, 2013 1:31 pm

Hm... This fits very good to the values given from WGM Natalia Pogonina:

Equality: =, from 0 to 0.26

Small advantage for White: +/=, over 0.27 and up to 0.7

Serious advantage for White: +/- over 0.7

Decisive advantage for White: +-, over 1.5

I don't know, whether this values are own estimations or values derived from a specific program, but they look very good.

engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols