I think I have explained my testing pretty carefully, albeit in many different posts. I deal with two issues differently:rjgibert wrote:I think you're overlooking something rather obvious. You say to make sure, you test with slow games, but this is not possible. The number of changes you have made to crafty are legion and most of them you tested with much slower hardware. You might do some retesting with your faster hardware from time to time, but you can't repeat all the tests you have made for all of your changes. In short, yesteryears slow game testing is effectively fast game testing today, so whether you like it or not, you must do some type of extrapolating whether you are aware of it or not. The best you can do is settle on some rational coherent way of doing something like I suggest rather pretend your slow game testing makes what you do all that different. The fact that hardware is contunuallly improving at least as fast as the software screws your scheme up. This is just one of many fundamental problems of testing in computer chess.
If you don't like scaling with 2 fast time limit games, maybe you would prefer testing with 3 fast time limit games to see if the scaling curves upward or downward. The thing to do is to try to measure the general reliability of any such procedure to see just how useful it is.
(1) eval changes. I test and tune using fast games, and then confirm that things are still better with slower games. But I often make several different changes, test each with fast time controls to make sure each change produces an improvement. Then I will check the final product with a slower match to make sure none of those regressive changes have slipped in, such as the one I have previously mentioned dealing with the 1985 ACM tournament and the 1986 WCCC event.
(2) search changes. Here I am more cautious and check at several time controls. I start with fast. If the idea is unclear as to what I expect, and fast time controls show worse results or even no improvement, I may stop. If my intuition suggests that the idea ought to work, I may well go to longer and longer time controls to see how it looks.
For the last 2-3 years, I have been using the same cluster hardware. You are correct that what I can do with one CPU there at long time controls is more like blitz on a 8-16 core box. So I obviously can't test with next year's search depths in mind, and I am sure I take a hit here and there by doing so. But, I feel much more comfortable erring on the side I am on, which is to base my final decision on longish games with today's hardware, rather than basing them on short games and trying to measure the slope of the Elo change from very fast to fast games. I prefer to include slow games today, knowing that 2-3 years from now, I will probably want to retest some of the things I am doing...
If you want to claim that testing is impossible where some changes are depth-sensitive, you have a point. However, again, which would you trust more, intuition based on bullet vs blitz, or actual results based on bullet, blitz _and_ standard time controls, even if you are limited by today's hardware???
I find eval changes pretty easy to evaluate. Most behave as expected and produce the same + or - regardless of the time control, which makes testing those quite fast and simple. Search changes are much different. I'd offer the "history pruning" idea from the first version of Fruit that used it as an example. The "history counter" is useless. Completely useless. Based on millions of games of testing using Fruit vs other programs and adjusting the history threshold from 0 to 100. If you turn LMR off it will hurt. Maybe 40 Elo. But if you leave it on and just try to optimize the history pruning threshold, there is no correlation between setting and Elo. Or maybe you gain 1 or 2 Elo. Which takes beyond 100K games to measure, and even then there is a statistical uncertainty left. I've been working on LMR for a couple of months now. And to date, I have found _nothing_ that provably helps Crafty play better. I thought I had last week but longer time controls showed "no". Relying on very fast and fast time controls can be quite misleading based on results I have been getting, when dealing with search-related changes.