hgm wrote:bob wrote:They are _NOT_ correlated.  It is pure idiocy to even consider that possibility since I have explained _exactly_ how I run test matches.
Good! So if x_i, is the result of game i in mini-match j, you claim that the correlation between x_ij and x_km, which is defined as cov(x_ij,x_km)/sqrt(var(x_ij)*var(x_km)) equals 0. (Cov stands for covariance.) Thus you claim that
cov(x_ij,x_km) = 0,
as the variances of the individual game results are all finite and limited to 1 (as the results are limited to +1, -1). Now covariance is defined as
cov(x_ij,x_km) = E(x_ij*x_km) - E(x_ij)*E(x_km),
with E(x) the expectation value of quantity x.
Now for the result R_j (R for short) of a mini-match, R_j = SUM_i x_ij, we have
E(R) = SUM_i E(x_ij),
and
var(R) = E(R*R) - E(R)*E(R) = E((SUM_i x_ij)*(SUM_i' x_i'j)) - (SUM_i E(x_ij))*(SUM_i' E(x_i'j)
= SUM_ii' { E(x_ij*x_i'j) - E(x_ij)*E(x_i'j) }
= SUM_i var(x_ij) + SUM_(i,i') cov(x_ij,x_i'j),
where the last line comes from grouping the double sum over i and i' in terms with i=i' in the first sum, and i != i' in the second sum. Now you claim that cov(x_ij,x_i'j) = 0 for every i != i', so we have:
var(R) = SUM_i var(x_ij).
now 0 <= var(x_ij) <= 1 for every i, so we have
0 <= var(R) <= M,
where M is the number of games in a minimatch.
For 'expectation value', you can also read 'average over all mini-matches j', and the derivation also holds for that.
So according to your statement about the correlation of the game results, the variance of the mini-match results is below M (i.e. the standard deviation is below sqrt(M)).
This is in contradiction to your claim that on the average (over j) you see a variance of R much larger than M. So are you willing now to drop that claim about the variance? Or are you maintaining both claims, and don't bat an eye when adding M terms smaller than 1 produce a result larger than M?  
 
 
Like Uri says, the correlations and covariances are just quantities derived from your dataset, and from what you told (or led us to believe...) is that the var(R) > M. As we _know_ the variance of the individual game results to be bounded by 1, we thus thus know, by your own claim, that the covariances in your data set are nonzero.
All you do is argue that there cannot be any _causal_ relationship between the games. But that has no relevance for the fact that the _correlations_ apparantly are there, in your dataset. (Unless you are just bullshitting us.) And you claim that this is a typical  behavior, not a one-time fluke because of an extremely unlucky start, that goes away after long-enough data collection and averaging. You claim that you keep _consistently_ seeing this over-representation of rare deviations in your data. So the burden is on you now, to explain how this correlation could maintain itself in the absense of a causal relationship.
To someone that has only the slightest insight in statistics, your claims are of course as idiotic as when someone would claim that new measurements with his new expensive machine has proved that the fraction of oxygen in air is not 20%, but 110%. He will be met with ridicule, even by persons that don't know how his equipment works. And quite justly so, as uncritically publishing such obviously faulty data is simply bad science...
 
I have no idea where you are heading with this.  Let me recap _my_ claim, and bypass all the other noise that is introduced...
I ran a 5000 game test between Crafty and one of the opponents.  I think it was glaurung but I am not sure.  At the time control I used, the average result of the 80-game matches was -1 (I am talking about the most recent set of data since I still have that handy).  I then reported the results of the individual 80-game matches, as well as the overall average.  Some of those results were _way_ off from what was expected.
In the original data where I posted 4 strings of +/=/- results, I did the same thing, and just grabbed the first 4 results from the 64 matches that were actually played.  I reported those results.
1.  No, the results are not correlated in the usual sense of the word, because I know every game is an independent test.
2.  No, I don't care whether a correlation test suggests correlation or not.  The results are random enough  that a single 80-game sample (1/64th of the total results) could suggest anything.
3.  The data simply was whatever I had available at the time of the post.  I don't keep 95% of the results we produce.  Keeping up with what means what is not worth it.  I keep summaries (as in my most recent posting giving the results of all 64 matches, and then averaging them in groups) of the results for versions we keep, until they become superceded by a newer version.  I always have the results for the "current code", and for all previous versions (20.0, 20.1, ...) as well, but not for rejected versions or versions that were later replaced with a better one.  21.6a vs 21.6b, for example, where 21.6a results are discarded once we are sure 21.6b is better.
I have made no claims about correlation or covariance.  I have said that the results are far more random than I expected, and based on lots of results posted here, they are far more random than anyone else expects either since hardly anyone reports on 100 game matches, and 100 games matches are nowhere near enough to smooth out the randomness I see in the set of programs I am testing.
That's all I have said.  There is nothing false or misleading in that.  So the nonsense about "is it correlated or is that kind of randomness impossible" is 100% pointless.  It is what it is.  Nothing more, nothing less.  You can accept it, or ignore it, I really don't care.  But it is what it is.  And all the shuffling around, statistical analysys, theoretical discussions, opinions, theories, guesses, don't mean a thing.  Because it is what it is...
It is time to move on.  If those original 4 matches were a 10 sigma event, it is what it is.  Nothing more, nothing less.
What else is left to say here?  I am interested in reducing the standard error to something low enough that I can consider it zero.  1 in 20 is not that error rate.  If you want to accept that or higher, fine.  If your program is primitive enough that it doesn't exhibit non-deterministic play, fine.  But every program I have personally tested and use in my tests certainly does do this.  So for _most_ of us, the number of games needed to predict progress with high confidence is quite large.  And all the arguing in the world is not going to change that.