100 games is plenty if your improvement is worth several hundred ELO. 100 games will reveal that you made an improvement with fair certainty.gingell wrote:I've been doing my testing along the following lines: Make a change, compile with and without that change, and have the new and the old versions play maybe 100 games against each other. I understand that the error bars are still pretty wide with this number of games, but I've found it's a good enough starting point to see whether a change should be pursued further.
... other stuff
Thanks for any comment.
But the bad news is that most changes are worth less than 10 ELO unless your program is pretty raw, in which case most changes are still going to be worth far less than 100 ELO. Which means 100 games is almost worthless.
The REALLY bad news is that you need several thousand games to measure changes with certainty, and that is usually something like 50 thousand. If you base decisions about what to keep and throw away based on smaller samples than say 50,000, then you will find that you are making random changes to your program. Of course this is not true if some of your changes are giving you 40 or 50 ELO, you might be able to get away with just a few thousand games in that case.
More bad news is that you cannot get a meaningful sample unless you have enormous CPU power at your disposal or you test fast. I am forced to test fast, even though I don't believe in fast testing.
I tend to test using fixed depth searches, and occasionally super fast Fischer when I want to check out major versions. The slowest fixed depth search I can get away with is 9 ply. My tester allows me to play 18 games per minute on my quad at 9 ply searches.
Sometimes I want to find out something really fast. I can play many 3 ply games per second. I sometimes use that to quick test a concept, keeping in mind that it is not to be trusted. It's just data I take with a grain of salt.