hgm wrote:I don't see how this observation in anyway can be interpreted as a drawback of the fixed-nr-of-nodes testing method. IMO you would get exactly the same variability if you would compare testing at 20 sec per move vs 20.01 sec per move.
This observation is simply based on the central limit theorem of statistics. The population of possible games two programs can play is huge. Taking a _small_ sample of those games can end up choosing a sub-set of games that make the new version look better, worse, or the same compared to the old version. And with so many possible "small sample sizes" available, picking just one is just a random choice.
If you used fixed nodes and make _any_ change to the search or evaluation, then even though you search the same number of nodes per move in the new test, the trees are differently shaped due to the changes you made. Again you take a tiny sample from a very large population and you can't reliably draw conclusions.
Here's some samples from crafty data. Using the 40 silver positions, two games per position alternating colors. program identities not provided to avoid starting a tangential discussion:
Code: Select all
1: 30/ 16/ 34 ( -4)
2: 31/ 19/ 30 ( 1)
3: 31/ 18/ 31 ( 0)
4: 24/ 21/ 35 (-11)
5: 35/ 16/ 29 ( 6)
Those 5 tests were run consecutively, same time control for each 80 game test, same starting positions, same everything. Both programs using a time limit per move. No pondering. No SMP search. No opening book. No tablebases. No other users on the machines being used whatsoever. All conditions repeated for each run as near to identical as possible.
Now which one of those "tells the truth"??? You could pick two of 'em and conclude your changes are better, you could pick two and conclude your changes are worse. You could pick one and conclude no change at all.
That's why I said 80 games is _not_ enough. I have 4 programs I use to test, one of which is Crafty. Any two programs show this behavior when played against each other.
But if you repeat two tests of the same engines at 20M nodes/move, you will exactly reproduce the result. Testing the same version twice at 20 sec/move, will give you different, and most likely completely independent results.
Yes, but what's the point of that testing, because if you make one tiny change to anything, the results will change because now you have a different sample of games since parts of the trees will expand or contract.
To recap:
search a fixed number of nodes, record the results. Make any change to the program you want. Search a fixed number of nodes and the results are absolutely useless to compare with the first run.
Is that hard to understand given the data I provided above?
Remember, I've been able to run hundreds of thousands of games in analyzing this effect, something probably no one else here could do in their lifetime.
If you do very small changes in the program, e.g. in the evaluation of an end-game situation that rarely occurs, comparing the two versions at fixed nr of nodes would eliminate a large amount of variance from the difference, as in most games there would be no difference at all. That testing with 0.05% more nodes would cause both the result of the old and the new version to be different, is completely irrelevant, as even in that case, it would likely be the same. (Unless the program change would have changed the result, which is exactly what you want to measure.)
You should first try the test before stating whether or not the results will be irrelevant. I can tell you with 100% certainty your thoughts are absolutely wrong here, based on a huge volume of data.