engine evaluation and chess informant symbols

Don · Post by **Don** » Thu Jan 03, 2013 1:50 pm

casaschi wrote:I'm doing some experiments also in the same direction as Don. From the page Don linked I found a good file with about 100 games with chess informant annotations, good quality and a reasonable number of position assessed, about 700, more or less 100 for each of the 7 symbols. Still not statistically significant, but good enough to practice.

I just started, but already have some comments looking at the evaluation results as they are produced:

1) what amount of time to use for the position eval? I'm using a 1m evaluation with garbochess, that's probably needed for a good stable result, but at the same time will tune the assessment on a level that most users will never use... most users of garbochess with my application will let the engine running for a much shorter time.

I ran to 10 ply and averaged the last 2 iterations, 9 and 10. I don't think this matters that much though. You could run it to 2 and 3 ply and the results would come out very similar. The deeper searches will generally have the same scores as the shallow searches, but of course the primary difference is the occasionally position where the deeper search discovers something important. In those few cases the deeper search will be more accurate.

2) the engine does not really get some of the position... or the annotator was wrong in the assessment. How to spot and exclude those position since they clearly should not be used here. I mean, if I try to decide the appropriate eval thresholds for the +/= sign and garbochess thinks that a given position is actually better for Black, that position should clearly not be used to tune the +/= sign. At the same time, too much of this manual selection of the position will likely lead the results anywhere I want the results to be...

This is why I used the median of a larger sample. there will clearly be some mis-evaluation by both human and computer but the median values will be very stable and reliable. To get median I sorted by score and use the value right in the middle. If there were an even number of entries you average the 2 scores in the middle.

3) I'm even wondering if all this actually makes sense. In other words, does it makes sense to use the engine score as an absolute position evaluation? Looking at some positions, the answer is not so obvious... for example, garbochess seems to have a significant premium for the Bishops pair. So an opening position where Black has the Bishops pair and doubled pawns is assessed by chess informant as slighly better for White, while garbochess scores the position as -0.7 pawns. Could it be that the engine eval cannot actually be assumed an objective evaluation of the position but only as a tool that drives engines to make the best move selection for them? Translating evals in the chess informant symbols somehow assumes that position with similar engine eval have similar winning chances... but that might not be true at all and a +0.2 in position A might mean something completely different than +0.2 in a very different position B

I strongly suspect that a computer will be far more reliable than a human at doing this. Humans are wrong a lot and the computer will also be wrong from time to time. But don't expect to get this perfect - it is not possible.
If the computer has some unreasonable bias it probably just needs an adjustment, but I would not worry too much about this. Please note that every human play has some biases of his own.

Some programs have a serious odd/even scoring affect, especially at shallow depths. So I would suggest you average the scores of the previous 2 iteration completions to cancel out this affect.

It's ok to say "+/= according to garbochess" and just "+/=" implicitly is the same, saying that this is from the point of view of a biased judge. There is no such thing as an objective measure here. People enjoy having feedback from the computer and understand that it's not perfect.

casaschi · Post by **casaschi** » Thu Jan 03, 2013 2:37 pm

Don wrote:It's ok to say "+/= according to garbochess" and just "+/=" implicitly is the same, saying that this is from the point of view of a biased judge. There is no such thing as an objective measure here. People enjoy having feedback from the computer and understand that it's not perfect.

The more I think about it the more I'm concerned about at what extent the engine evaluation can be used to compare very different position or it's just designed to drive the engine's move choice in a given position.
Again, the question is: does it make sense to associate an evaluation symbol (as some sort of winning probability estimation) for any position based only on the engine numeric evaluation? In other words, are we sure that +0.5 should corresponds always to the same winning chances in ANY position? Or is the engine eval only useful to compare deltas from similar positions, but comparing those scores for very different positions those value might drift?

Ralf Müller · Post by **Ralf Müller** » Thu Jan 03, 2013 3:21 pm

I think, the best example is the 0.00-evaluation.
For the engine it doesn't matter, whether there are two kings on the board or a complicated draw in 20 moves. But the more equal position is in my opinion clearly the two kings. But this are all different approaches to eval... (result/chances)

Don · Post by **Don** » Thu Jan 03, 2013 5:12 pm

casaschi wrote:
Don wrote:It's ok to say "+/= according to garbochess" and just "+/=" implicitly is the same, saying that this is from the point of view of a biased judge. There is no such thing as an objective measure here. People enjoy having feedback from the computer and understand that it's not perfect.
The more I think about it the more I'm concerned about at what extent the engine evaluation can be used to compare very different position or it's just designed to drive the engine's move choice in a given position.
Again, the question is: does it make sense to associate an evaluation symbol (as some sort of winning probability estimation) for any position based only on the engine numeric evaluation? In other words, are we sure that +0.5 should corresponds always to the same winning chances in ANY position? Or is the engine eval only useful to compare deltas from similar positions, but comparing those scores for very different positions those value might drift?

You are over-thinking this but the answer is that +0.5 should correspond to the same winning chances in any position. To the extent that it does or does not has to do with the quality of the evaluation function. In Komodo we try to make it be so. For any given program, if it does not, it's an indication that the evaluation function is broken because it will make incorrect decisions based on this disparity. We accept a certain amount of that because we know that evaluation is a black art and it's far from perfect. It's not a major problem in most positions, the whole point of the evaluation is to compare 2 positions and determine which is better. So there is nothing wrong with expecting this behavior to be reasonably correct.

Everything I just said here also applies to humans. Don't think they have no bias in this regard.

The simplest thing for you to do here is not display these glyphs and simply display the garbochess score. If you want to be obsessive compulsive about this you should realize that this doesn't fix anything - the score will be right sometimes and wrong sometimes but on average it will be a pretty good indication of who stands better and by how much.

I get the feeling that there is nothing you could do which would make you happy here. Welcome to computer chess! I feel the same way about Komodo!

casaschi · Post by **casaschi** » Sat Jan 05, 2013 11:38 am

Did some testing with a file of about 100 games from chess informant, containing about 750 of the annotation symbols: -+ -/+ =/+ = +/= +/- +-
I run garbochess for about one minute and took the evaluation at the end. Then I removed the obvious "disagreements" between the annotator and garbochess (endings can be a problem, the annotator can see a +- while the engine does not see the win yet but only a +0.5 advantage, probably enough to drive the engine to a win though).
Finally I assumed symmetry for white/black symbols and evaluations, so for instance I merged the eval values for +- and -+ (after changing the sign of the latter, of course).

Even then, the results are all over and inconclusive to the aim of tuning the eval thresholds between the symbols. There is a major overlap of the evaluation values for +/- and +/= and for +/= and =. Also the distributions for +/- and +/= have multiple peaks and hardly resemble anything "normal".

I tried also a "best fit" but because of the overlap, there is no clear winner, out of about 500 scores (exccluding obvious disagreements and also obvious matches, such as a forced mate that will be scored as +- no matter what the threshold selections is), a large set of very different thresholds produce a number of matches between 250 and 270.
Surprising, and somehow discouraging, that the "best fit" has such a low rate of matches. This sort goes back to question whether it makes any sense to translate engine eval to one of the seven annotation symbols. I'm more and more skeptical about the value of this exercise, on the other end the alternative of showing to the user the actual engine eval instead is still worse (what does the general user know about the meaning of a +0.3 from garbochess?).

The only useful info could confirm (as suggested by others in this thread) is that my threshold values 0.35/1.35/3.95 are generally too high, especially the last one is shown by the data to be lower (not so much the others, because of the wide range of best matches, but if 3.95 is lowered then you need to space the thresholds).
I changed the values in my working page to 0.25/0.75/1.75 I'll leave those on for testing for a while and see if anything odd comes up.

Don · Post by **Don** » Sat Jan 05, 2013 2:13 pm

casaschi wrote:Did some testing with a file of about 100 games from chess informant, containing about 750 of the annotation symbols: -+ -/+ =/+ = +/= +/- +-
I run garbochess for about one minute and took the evaluation at the end. Then I removed the obvious "disagreements" between the annotator and garbochess (endings can be a problem, the annotator can see a +- while the engine does not see the win yet but only a +0.5 advantage, probably enough to drive the engine to a win though).
Finally I assumed symmetry for white/black symbols and evaluations, so for instance I merged the eval values for +- and -+ (after changing the sign of the latter, of course).

Even then, the results are all over and inconclusive to the aim of tuning the eval thresholds between the symbols. There is a major overlap of the evaluation values for +/- and +/= and for +/= and =. Also the distributions for +/- and +/= have multiple peaks and hardly resemble anything "normal".

That would be true also of human annotators. A human annotates as he see's fit and there would likely be wild disagreement between any two of them. I am convinced that a program with good evaluation would be far more consistent and reliable that strong players would be.

You need to realize that this is a black art as they say. An annotation symbol is just someones opinion and it's not precise. If you can get past that you will see there is no problem.

I tried also a "best fit" but because of the overlap, there is no clear winner, out of about 500 scores (exccluding obvious disagreements and also obvious matches, such as a forced mate that will be scored as +- no matter what the threshold selections is), a large set of very different thresholds produce a number of matches between 250 and 270.
Surprising, and somehow discouraging, that the "best fit" has such a low rate of matches. This sort goes back to question whether it makes any sense to translate engine eval to one of the seven annotation symbols. I'm more and more skeptical about the value of this exercise, on the other end the alternative of showing to the user the actual engine eval instead is still worse (what does the general user know about the meaning of a +0.3 from garbochess?).

It's possible that garbochess has such bad evaluation that it will indeed be problematic. If it is missing a lot of concepts such as sophisticated pawn structure it will be like a 1200 player annotating games, but perhaps tactically better than a 1200 player. It would be good to duplicate your experiment with a stronger program just to get a sense of what the issues are. What we should do is create a database of fen positions with some information about the annotation that we can pass around. Can you send me the pgn file you are using?

The only useful info could confirm (as suggested by others in this thread) is that my threshold values 0.35/1.35/3.95 are generally too high, especially the last one is shown by the data to be lower (not so much the others, because of the wide range of best matches, but if 3.95 is lowered then you need to space the thresholds).
I changed the values in my working page to 0.25/0.75/1.75 I'll leave those on for testing for a while and see if anything odd comes up.

casaschi · Post by **casaschi** » Sat Jan 05, 2013 2:22 pm

Don wrote:Can you send me the pgn file you are using?

I used the positions with an annotation symbol from this file:
http://www.angelfire.com/games3/smartbr ... ormant.zip

Don · Post by **Don** » Sat Jan 05, 2013 6:32 pm

casaschi wrote:
Don wrote:Can you send me the pgn file you are using?
I used the positions with an annotation symbol from this file:
http://www.angelfire.com/games3/smartbr ... ormant.zip

I just get a 404 message from my browser with that address.

casaschi · Post by **casaschi** » Sat Jan 05, 2013 6:39 pm

Don wrote:
casaschi wrote:
Don wrote:Can you send me the pgn file you are using?
I used the positions with an annotation symbol from this file:
http://www.angelfire.com/games3/smartbr ... ormant.zip
I just get a 404 message from my browser with that address.

Maybe they dont allow links from other pages... try going to
http://www.angelfire.com/games3/smartbridge/
and look for the file labeled "D00 Opening"

If this fails, let me know an email address I can send the files to you.

casaschi · Post by **casaschi** » Thu Jan 10, 2013 3:37 pm

...one more think on the subject of translating engine evals into chess informant notations: what would be the criteria to assign ? and ?? signs?

The first idea would be to look for moves that cause of an evaluation drop against the player that just moved; for example with Black to move the engine eval is -1, after Black's move the enging eval goes to -0.3: that would be an evaluation drop of 0.7 against Black.

Any other suggestion than this for assigning ? and ?? symbols?

If that model is used, I'm also wondering what would be sensible value drop thresholds for ? and ??.

This could possibly be added to the quantitative analysis, correlating engine drops and marks assigned by trustworthy annotators.

Another approach would be to link those values to the thresholds for the -+, -/+, =/+ and so on, so that a ? would more or less correspond to to make the evaluation score worse by one step and a ?? would make the evaluation score worse by two steps or more... orsomething similar...

engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols

Re: engine evaluation and chess informant symbols