The problem with almost all test positions is that they require both an evaluation-specific thing, and a tree search. You want the former to be critical, and the latter to be irrelevant for this kind of testing. Pushing a pawn to "cement" a black pawn on a weak square is something an evaluation could find on its own. Or if it is not as clever, it could depend on a short tree search to see that the pawn is sorta-weak, and if the opponent can push it and trade it quickly the weakness goes away, while of we play to prevent its movement, the weakness stays.MattieShoes wrote:That's kind of what I was getting at. They went through a lot of work to make their eval tuning work, and the paper details some of the pitfalls, like culling positions where the chosen move is wildly different in score than the "best" move, and how deeper searches yield better results. The functions they were using to measure quality of eval could be used to rank quality of different engines just as easily.
They also point out that the tuning helped but the most "tuned" versions underperformed. I'm guessing the eval was getting the right answers for the wrong reasons then, so even with care, you're likely to get outliers, where their strength is not well represented by their score.
If you want to tune your evaluation, you need to tune against positions where the evaluation is known and no search is required. Then you can _really_ tune your eval to adjust the score to match what is known to be correct. But test positions rarely do that, they require search and evaluation together, and often search can compensate for mis-evaluations, and evaluation tuning is irrelevant in such positions.
I gave up on this type of testing years ago. Even playing fast games can be very misleading about a change. You can make a program very aggressive with passed pawn pushes and a shallow-searching opponent will get into trouble. And the aggressive pawn pushing looks good. But in longer games, all it does is advance the pawn where it is easier to capture, and it can look much worse...
That's why I test with fast games, and verify with slow games. And occasionally even retest with slow games when fast games look bad, just to be sure than a change that looks good intuitively but shows up as bad in fast games, is also bad in longer games as well.