Anthony Cozzie worked on this idea using simulated annealing (several years ago). What we got after I modified crafty to make each evaluation term individually changable via command was a lot of noise. Caused by the inherent randomness chess engines show when using time as the search limiting control feature.zamar wrote:Auch. I'm currently testing automated tuning system, where the basic idea is to test if randomly modified engine can beat the original one. But if that what you say is true, then I'm doomed to fail, because the changes might make engine to play worse against all the other engines. Is that kind of behaviour common? I guess you must have tested many variables for your paper.bob wrote: A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go. You need more than one or two opponents as I have made changes that helped against A, but hurt against B, C and D. If you only play A you can reach the wrong conclusion and make a tuning mistake that will hurt in a real tournament.
I currently do _some_ eval tuning by running successive matches (32K games,, one hour per match) with different values to see which direction the term needs to move for improvement...
Until I started this about 2 years ago, I had no idea how hard it would be to actually determine whether a change is good or bad, unless it is a _major_ change...
