Fritzlein wrote:Let's take the heat off of the two particular positions by rotating the experiment ninety degrees. Instead of having two positions played 201 times each, let's have 201 positions played twice each. Then we measure whether the winner of the first play of each position is likely to be the winner of the second play of each position. If the playouts are uncorrelated, then there should be as many pairs of playouts with different winners as there are pairs of playouts with the same winner. If the two repeat playouts are at all correlated, then the winners will be the same more often than they are different. (If it is truly random, then the winners might even be different more often than they are the same, in which case the correlation will look negative, but one must suppose that is a fluke.)
Now if your 201 positions are a representative sample from your opening book, you might suppose that white has a 55% chance to win, ignoring draws for the moment. The chance that the results are the same is 0.55*0.55 + 0.45*0.45 = 50.5%, and the chance they are different is 0.55*0.45 + 0.45*0.55 = 49.5%. Aha! There is positive correlation since more are the same than different.
Whoops, I biffed that explanation. Having more the same than different doesn't necessarily mean they are correlated. If
every position is 55% for white then we can have two test runs that are perfectly uncorrelated even though there will be more repeat winners than different winners. My bad. Where correlation actually creeps in is like in my first example where one position is 90% for white and the next is 10% for white in a sample of positions that are overall 55% for white. Or, more realistically, where one position is 65% for white and the next is 45% for white.
Getting back to Crafty vs. Fruit, let me assume that Fruit's average score across all positions is 60%. Now if every position is 60% for Fruit in repeated playouts, then there will be no correlation in repeated trials on the same position set. If, however, some positions are 75% for Fruit when replayed, others are 45% for Fruit when replayed, etc., then there will be correlation in repeated runs over the same position set (as hgm was saying). To eliminate that correlation we need to use each position only once (as hgm was not saying but I am).
Correlation is correlation, regardless of whether the exact reason for a particular position being merely 45% for Fruit is that white simply stands worse there, or that Fruit doesn't understand opposite-side castling so well, or that the clock doesn't jitter enough and the expected score would revert to 60% if we had enough randomness. I guess I need to back off my confident assertion that clock jitter isn't enough to "randomize" the outcome. What I should claim instead is that clock jitter isn't sufficient to make a series playouts from one position behave like a series of playouts from a series of different positions chosen at random and used only once each. The effect of correlation on the statistical significance of Bob's original tests is potentially huge regardless of the source of the correlation, but I am more confident that correlation is present between repeat playouts of the same positions than I am confident that I know what causes it.