bob wrote:However, would it be possible for either (a) you _do_ follow a specific discussion and post comments related to it or (b) if you choose to not follow context, then also choose to not make comments that have nothing to do with the discussion?
You would have to be more specific. That you perhaps cannot see the relevance of my remarks, does not necessarily mean that they have nothing to do with the discussion. In so far the discussion has to do with anything in the first place, that is...
Someone asked me to run the test. It _did_ change the results. I ran 4 runs to see if the variability was lower, higher, or the same.
Did it? What change are you talking about? 2 Elo on Crafty's rating, that had an error bar of +/- 18 in the first place? You call that a change? Your statement is actually very questionable "It _did_ change the results." What is 'it' here? Did that (insignificant) change in the results occur because you included games between the other engines, or simply because you redid the test (or some games of the test). The Crafty rating would very likely change much more than 2 points (namely 9 point = 1 SD) if you would have redone the Crafty games without playing the opponents against each other. And it would likely have changed in another way if you only redid the games of the opponents against each other. The conclusion that including games between opponents is not justified.
And if someone has done that then most all past testing is flawed, since I am using the same unmodified source that was used in the CCRL, SSDF, etc testing.
A good tester would recognize such engines as variable, and delete its results from their tests. This is why you have to keep track of variance and correlations within the results of each engine.
And btw, that would _not_ corrupt the test and make your "dependency" condition show up. A program would simply play better about 1/2 the time. So even that statistical suggestion would be wrong. A binary decision based on even/odd months is not going to produce any sort of dependency in the data.
It most certainly will. The data will be highly correlated timewise, and the stochastic processes producing them will be highly dependent in the mathematical sense. That the causal relationship in fact is that both are dependent on the month of the year, is something that the math does not care about.
Would produce variation if your sample interval matched up, but not dependency.
No, because the TOD is always synced to NTP sources. You have suggested a way to break the test. If anyone thought that was _really_ the issue, the only flaw with that reasoning is that _all_ the programs would have to be doing that, and in a way that would get in sync with my testing. Month is too coarse when you run a test in a day, except right at the boundary. However, it is certainly easy enough to verify that this doesn't happen, if someone wants to. The source for all the engines _is_ available. But I suspect no one would consider stomping through source to see if _that_ is actually happening. Because if it was, it would show up in other testing as well.
You might not be able to see it at all, by looking at the source, just as you will in general not be able to debug a source by only looking at it. The behavior I described might be the result of a well-hidden bug. The only reliable way to find bugs is by looking how they manifest themselve.
This is not far-fetched. It did actually happen to me. At some point I noted that Joker, for a search to a given depth from the opening position would sometimes produce a different score at the same search depth, even when 'random' was switched off. Turned out that the local array in the evaluation routine that held the backward-most Pawn of each file was only initialized up to the g-file. And the byte for the h-file coincided with a memory address that was used to hold the high-order byte of a clock variable, read to know if the search should be aborted. During weeks where this was large and positive, the backward-most pawn was found without error. When it was negative, the negative value stuck like an off-board ghost Pawn, and caused horrendous misjudgements in the Pawn evaluation, leading to disastrous Pawn moves, losing the game.
Even if that were true (and it does not necssessaryly have to be true, as the problem might be in one of the engines, and other people might not use that engine), the main point is that others do not care. They do not make 25,000-game runs, so their errors are always dominated by sampling statistics. Systematic errors that are 6-sigma in a 25,000-game run are only 0.6 sigma in a 250-game run, and of no consequence. You could not see them, and they would not affect your testing accuracy.
What on earth does that mean? So you are _now_ going to tell me that you would get more accurate results with 250 games than with 25,000?

Simply because the SD is now much larger?

Hint: that is _not_ a good definition of accuracy...
Are you incapable of drawing any correct inferences whatsoever? An elephant does not care about the weight of a dog on its back, while that same dog would crush a spider. You conclude from that that elephants are too small to be crushed????
And if you had been reading, you would have noted that I have attempted to do this. No books, no learning, no pondering, no SMP search, A clean directory with just engine executables that is zapped after each game to avoid hidden files that could contain anything from persistent hash to evaluation weight changes. All we are left with is a pool of engines, using the same time control repeatedly, playing the same positions repeatedly, playing on the same pool of identical processors repeatedly. No the timing is not exact. But since that is how everyone is testing, it would seem that is something that is going to have to be dealt with as is. yes a program can on occasion fail low just because it gets to search a few more nodes due to timing jitter. Yes that might burn many seconds of CPU time to resolve. Yes that will change the time per move for the rest of the game. Yes that may change the outcome.
Yes, all very nice. But it did not reduce the variance to the desired (and designed!) level. And it is results that count. An expensive looking spotless car is no good if it refuces to start. Even if the engine is brand new, still under factory guarantee, the gas tank is loaded up, the battery is charged. But you would still have to walk...
However, if I am designing a telescope that is going to be mounted on the surface of planet earth, I am not going to waste much time planning on using a perfect vacuum where there is no air diffraction going on, because that is not going to be possible since only a very few have enough resources to shoot something like Hubble up into orbit. And it is immaterial because we have already decided this will be surface-mounted.
A very educational example, as people are now building huge Earth-based telescopes, which correct for air turbulence by using adaptive optics, and exceed the limits of their mirror size by being grouped as clusters of interferometers, as one does for radio telecopes. If you cannot remove a noise source, you will have to learn to live with it, and outsmart it...