hgm wrote:bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the
truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
OK. Here you go. First a direct quote from karl:
============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.
When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================
Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.
He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.
And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.
And unfortunately, that is simply a matter of fact when choosing a significant number of positions... But the results are not "corrupted". I've had (and posted here) way too many of these same kinds of results, using these same positions. I am currently running another 4 sets with the new approach, this time making sure that I can save the PGN. 8 runs with consistent results will be a huge change from what I was getting with about the same number of games before, but using 100 times fewer positions.
Well, I already told you nearly a year ago that the error caused by the low number of positions was starting to dominate the error in your result for the number of games you play. Put that is only the error between your results and the true strength difference, which does not show up in the difference between runs with identical programs, as all these runs used the same small set of positions, and thus suffer equally when these positions are not representative.
The data of your first runs was corrupted. If you would have saved the PGN, you could have known that from the internal variation. (And you would have known how many positions you need to use.) Now you can only now that by repeating the run. But you are not going to do that. So in the end you will know noting, except how to determine a square root by playing Chess games.
Yeah, yeah. got it. "stampee foot. Data corrupted. Stampee foot. bad testing. etc..."
Hopefully we will get this fixed, in spite of your foot music... then you can move on to some other topic where you are absolutely correct, except when you aren't...