I ran to 10 ply and averaged the last 2 iterations, 9 and 10. I don't think this matters that much though. You could run it to 2 and 3 ply and the results would come out very similar. The deeper searches will generally have the same scores as the shallow searches, but of course the primary difference is the occasionally position where the deeper search discovers something important. In those few cases the deeper search will be more accurate.casaschi wrote:I'm doing some experiments also in the same direction as Don. From the page Don linked I found a good file with about 100 games with chess informant annotations, good quality and a reasonable number of position assessed, about 700, more or less 100 for each of the 7 symbols. Still not statistically significant, but good enough to practice.
I just started, but already have some comments looking at the evaluation results as they are produced:
1) what amount of time to use for the position eval? I'm using a 1m evaluation with garbochess, that's probably needed for a good stable result, but at the same time will tune the assessment on a level that most users will never use... most users of garbochess with my application will let the engine running for a much shorter time.
This is why I used the median of a larger sample. there will clearly be some mis-evaluation by both human and computer but the median values will be very stable and reliable. To get median I sorted by score and use the value right in the middle. If there were an even number of entries you average the 2 scores in the middle.
2) the engine does not really get some of the position... or the annotator was wrong in the assessment. How to spot and exclude those position since they clearly should not be used here. I mean, if I try to decide the appropriate eval thresholds for the +/= sign and garbochess thinks that a given position is actually better for Black, that position should clearly not be used to tune the +/= sign. At the same time, too much of this manual selection of the position will likely lead the results anywhere I want the results to be...
I strongly suspect that a computer will be far more reliable than a human at doing this. Humans are wrong a lot and the computer will also be wrong from time to time. But don't expect to get this perfect - it is not possible.
3) I'm even wondering if all this actually makes sense. In other words, does it makes sense to use the engine score as an absolute position evaluation? Looking at some positions, the answer is not so obvious... for example, garbochess seems to have a significant premium for the Bishops pair. So an opening position where Black has the Bishops pair and doubled pawns is assessed by chess informant as slighly better for White, while garbochess scores the position as -0.7 pawns. Could it be that the engine eval cannot actually be assumed an objective evaluation of the position but only as a tool that drives engines to make the best move selection for them? Translating evals in the chess informant symbols somehow assumes that position with similar engine eval have similar winning chances... but that might not be true at all and a +0.2 in position A might mean something completely different than +0.2 in a very different position B
If the computer has some unreasonable bias it probably just needs an adjustment, but I would not worry too much about this. Please note that every human play has some biases of his own.
Some programs have a serious odd/even scoring affect, especially at shallow depths. So I would suggest you average the scores of the previous 2 iteration completions to cancel out this affect.
It's ok to say "+/= according to garbochess" and just "+/=" implicitly is the same, saying that this is from the point of view of a biased judge. There is no such thing as an objective measure here. People enjoy having feedback from the computer and understand that it's not perfect.
