It struck me, however, that if the chain is long enough, the comparison becomes significant. E.g. if each next version would score half a standard deviation SD (of a single match) better than the previous one, and I would do that 16 times, the accumulated empirical difference is now 16 x SD/2 = 8 SD, while the accumulated statistical error in that difference is now sqrt(16) SD = 4 SD (as for a sum of independently measured quantities the variances add, not the standard deviations). So the chances that this 16th fix is better than the original one is now two times the statistical error in the accumulated difference, which is significant.
I never realized that before, and it actually seems pretty funny!
