An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

nczempin wrote:
Rein Halbersma wrote: ....

Chance of that happening are about 3.9e-05 or odds of about 1 to 25,000. Most people would consider the coin biased after such a result.
But those were the results he got after how many flips? It is a mistake to select the one noticeable event and claim it is particularly remarkable.

It would be remarkable if this was on the only experiment. But you remember it only as standing out because the other experiments where this didn't happen.


Flipping a coin and getting 70 heads, or getting 70 heads in succession? Are you both talking about the same thing? I am fairly sure that the probability of getting 70 heads in all instead of the expected 50 is not remarkable, and thus I would also not bat an eye.
Well, I am pretty sure he did not play 40-billion hands of blackjack. Excessive repeating might decrease the odds, of course, but one should always calculate by how much. A wise man knows when he is being cheated, only a dimwit would foolhardy play on indefinitely with a deck stripped of all aces, the latter being tucked away nicely in the sleeves of the dealer. That is what happens to you when you don't bat an eye on witnessing incredibly improbable events.

You are wrong about the coin flips. The quoted probability of 1 in 25,000 is for the total number of heads exceding 70, regardless of distribution amongst the 100. (The standard deviation in the number of heads is 0.5*sqrt(100) = 5, so witnessing 70 is a 4-sigma deviation.) You should raise an eye if this occured, say, after only trying 100 series of 100 coin flips. Very possible someone just switched the coins on you.

For 70 heads in a row the probability with a fair coin would be 1/2^70 ~ 1.18e-21. But you could fit that in 31 ways in a stretch of 100, so the probability of (at least) 70 consecutive heads in a sequence of 100 flips is 36.5e-21. Not in the lifetime of the universe...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:8 million PGN files is simply unmanagable.
It is possible to put them into 8 1-million-game files, or 80 100,000-game files, ... you could even store them in a more efficient format than PGN.

You're a CS prof, surely you can devise a solution?

So I guess it's not that you were unable to store them, but that you saw no need for storing them.

Which is perfectly acceptable.


Also note that for statistical analysis, you don't need the actual games at all, only the results.
I can store 8 billion games. The cluster has a 4-terabyte filesystem. That's not the issue. The issue is keeping up with what is what. And being able to use that in some way that justifies storing the games.

I'm not sure about your silly "surely you are able to devise a solution". You simply assumed something that was incorrect relating to what the problem is. I don't see any useful information I could obtain from 8 million games, which includes 80 game matches against each of 4-5-6 opponents, against hundreds of different Crafty versions. I just keep the "summary" data for any match where we choose to keep changes, and then consolidate results when version N+1 supercedes version N...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Rein Halbersma wrote:
bob wrote:And I have seen many "impossible events" over years of play. I should get a blackjack every 21 hands roughly. I have played for 5+ hours with a witness, at about 100 hands per hour, with absolutely zero blackjacks.
That's a pretty rare event indeed, with an probability of 2.5e-11 or odds of about 1 to 40 billion! Of course, that's assuming that every hand is independent from the previous ones, but I guess they play with >6 decks and reload the decks frequently?
Nope. 2 decks, about 65-70% penetration. It just happens. I also watched during my long dry run, where the player at 1st base got 4 in a row. Which is not that rare, but rare enough that I can not recall ever doing that myself.
bob wrote: The list goes on and on. If I flipped a coin 100 times and got 70 heads, I personally would not bat an eye.
Chance of that happening are about 3.9e-05 or odds of about 1 to 25,000. Most people would consider the coin biased after such a result.
"most". I'm more pessimistic having done this for so long. I can't count the number of advantage players that believe they are being cheated. I have seen people post that Norm Wattenberger's CVBJ program is "biased" (some say it gives you better cards than expected, some say worse). All because they assume those "extremely rare events" will never happen. I've lived thru the incredible negative variance streaks too many times to count, as well as the incredible win streaks that also come along.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
Rein Halbersma wrote:
bob wrote:And I have seen many "impossible events" over years of play. I should get a blackjack every 21 hands roughly. I have played for 5+ hours with a witness, at about 100 hands per hour, with absolutely zero blackjacks.
That's a pretty rare event indeed, with an probability of 2.5e-11 or odds of about 1 to 40 billion! Of course, that's assuming that every hand is independent from the previous ones, but I guess they play with >6 decks and reload the decks frequently?
bob wrote: The list goes on and on. If I flipped a coin 100 times and got 70 heads, I personally would not bat an eye.
Chance of that happening are about 3.9e-05 or odds of about 1 to 25,000. Most people would consider the coin biased after such a result.
But those were the results he got after how many flips? It is a mistake to select the one noticeable event and claim it is particularly remarkable.

It would be remarkable if this was on the only experiment. But you remember it only as standing out because the other experiments where this didn't happen.


Flipping a coin and getting 70 heads, or getting 70 heads in succession? Are you both talking about the same thing? I am fairly sure that the probability of getting 70 heads in all instead of the expected 50 is not remarkable, and thus I would also not bat an eye.
70 heads in a row is a one in 2^70 event. Very rare. But if you flip a coin 10 billion times, you run almost 10 billion individual trials and the probability goes up.

Flipping 70 in a row is _far_ less likely than flipping 70 heads out of 100. Interestingly flipping exactly 50 heads and 50 tails is just as unlikely. But do enough trials and the mean should average out to 50-50.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

hgm wrote:
nczempin wrote:
Rein Halbersma wrote: ....

Chance of that happening are about 3.9e-05 or odds of about 1 to 25,000. Most people would consider the coin biased after such a result.
But those were the results he got after how many flips? It is a mistake to select the one noticeable event and claim it is particularly remarkable.

It would be remarkable if this was on the only experiment. But you remember it only as standing out because the other experiments where this didn't happen.


Flipping a coin and getting 70 heads, or getting 70 heads in succession? Are you both talking about the same thing? I am fairly sure that the probability of getting 70 heads in all instead of the expected 50 is not remarkable, and thus I would also not bat an eye.
Well, I am pretty sure he did not play 40-billion hands of blackjack. Excessive repeating might decrease the odds, of course, but one should always calculate by how much. A wise man knows when he is being cheated, only a dimwit would foolhardy play on indefinitely with a deck stripped of all aces, the latter being tucked away nicely in the sleeves of the dealer. That is what happens to you when you don't bat an eye on witnessing incredibly improbable events.

You are wrong about the coin flips. The quoted probability of 1 in 25,000 is for the total number of heads exceding 70, regardless of distribution amongst the 100. (The standard deviation in the number of heads is 0.5*sqrt(100) = 5, so witnessing 70 is a 4-sigma deviation.) You should raise an eye if this occured, say, after only trying 100 series of 100 coin flips. Very possible someone just switched the coins on you.

For 70 heads in a row the probability with a fair coin would be 1/2^70 ~ 1.18e-21. But you could fit that in 31 ways in a stretch of 100, so the probability of (at least) 70 consecutive heads in a sequence of 100 flips is 36.5e-21. Not in the lifetime of the universe...
Okay, I guess I am leaning too far to the other side compared to other people, shrugging off such events. By default, I tend to not draw any conclusions, while most others seem to jump to conclusions.

It'll be hard to nail me down on a definite conclusion on anything.

Go ahead, give it a try :-)

Okay perhaps the sun thing I would more or less agree on :-)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Uri Blass wrote:
bob wrote:
hgm wrote: So explain us then how the results of games X and Y become correlated...
You set 'em up, I'll keep knocking the softballs out of the park. the answer is simple...

They are _NOT_ correlated. It is pure idiocy to even consider that possibility since I have explained _exactly_ how I run test matches. With nothing carried thru hash tables, thru any sort of learning, there is no way there is any true correlation between games. It just can't happen.

Of course, feel free to postulate some scenario where it could happen and I can tell you whether it is possible in my test setup or not. But I certainly see no way, and consider this an irrelevant discussion based on what I am doing.
This is simply not correct.

Claiming that result x is correlated with result y is a mathematical fact.

My point is this. In this context, correlated means that the games are somehow connected, that the result of one is somehow related to the result of another. That is impossible.

However, you can certainly produce purely random samples that would suggest correlation, when there is none.

So yes, you can test for correlation and conclude "yes". But, if you perform one test on planet earth, and a second test 200 million light years away, at the same time, you can be absolutely certain the two events actually are independent. Whether the test says otherwise or not.

I would not care if the test shows the games are perfectly correlated, because I know they are not. Otherwise every tournament or match ever played is no good.


You can say that it was because of luck but it is a mathematical fact and not a claim that variable x is dependent on variable y.

Note that you can always expect 2 variables to be correlated even if they are independent but usually you will get small correlation and you will be able to accept the conjecture that they are independent.

In this case the event that happened can happen only in one case out of million assuming that the variables are independent and assuming that H.G.Muller calculation are correct(I did not care to calculate the exact number).

It means that if you test H0 (they are independent) against
H1 they are dependent you will accept H1 unless you use alpha that is smaller than 1/million.

You can explain it by luck but it is logical to suspect that something is wrong in the condition of the experiment.

Uri
I have explained how I test. Feel free to suggest any possible scenario where two games somehow influence each other. Each game is played on a different machine. There are no shared files. No shared memory. No shared anything. So how do they affect each other?

They don't. In fact, to be stronger, they _can't_.

Unless intel xeons somehow communicate thru some subspace medium I am not aware of.

Actually I just thought of one method of correlation. Program A plays program B in different positions. But program A does play all games. If I play A against a weak opponent, certainly there will be high correlation because A is going to win most of the games, which will suggest the outcome of one game influences another since most outcomes are similar.

But that's a silly correlation effect.
Last edited by bob on Wed Sep 19, 2007 5:47 pm, edited 1 time in total.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
"most". I'm more pessimistic having done this for so long. I can't count the number of advantage players that believe they are being cheated. I have seen people post that Norm Wattenberger's CVBJ program is "biased" (some say it gives you better cards than expected, some say worse). All because they assume those "extremely rare events" will never happen. I've lived thru the incredible negative variance streaks too many times to count, as well as the incredible win streaks that also come along.
Well, I just had the first royal flush the other day, and I haven't played that many hands yet. Just shrugged it off...

Perhaps if I had been able to pay someone off with it I would have been more excited...
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: My point is this. In this context, correlated means that the games are somehow connected, that the result of one is somehow related to the result of another. That is impossible.
No it doesn't mean that.

It would only mean that the results are correlated. I already gave an example besides my extreme illustration:

If one starting position comes from the Sicilian Dragon, and the other occurs in the King's Indian, it is not completely out of the question that Engines that will score well in one position will also do so in the other, and for engines it will be the opposite (I mean they will do relatively better in other positions).

(Feel free to include better examples, please don't latch onto this particular example.)

So if you analyse statistically those positions (and not by qualitative reasoning like the one I used to come up with the example), it is possible that some within the test suite will show such a correlation. I am not saying that it will be the case for sure, just that it is possible.

And if such a correlation can be discovered, one of the positions can be eliminated from the suite, because it does not add significant information in relation to the effort it takes to include it in the test. Perhaps 40 positions is the sweet spot for you, but if it were 4000 positions that had been selected using just the same care but no analysis of the kind I am describing, surely you would have to find a way to reduce the number of positions to make them more manageable? Or is 40 some kind of number passed down from heaven that is guaranteed to be the exact number you need?

Put another way, if you had to start just with the starting position, or, without loss of generality, any given position from the set of 40:
Then you would argue that you would need more positions to get more variety, for example, one that is "sharp" in nature and one that is "quiet" in nature. And then you would break it down further. And at some level you would find that the next position will not have positive marginal utility, so to speak.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:They are _NOT_ correlated. It is pure idiocy to even consider that possibility since I have explained _exactly_ how I run test matches.
Good! So if x_i, is the result of game i in mini-match j, you claim that the correlation between x_ij and x_km, which is defined as cov(x_ij,x_km)/sqrt(var(x_ij)*var(x_km)) equals 0. (Cov stands for covariance.) Thus you claim that

cov(x_ij,x_km) = 0,

as the variances of the individual game results are all finite and limited to 1 (as the results are limited to +1, -1). Now covariance is defined as

cov(x_ij,x_km) = E(x_ij*x_km) - E(x_ij)*E(x_km),

with E(x) the expectation value of quantity x.

Now for the result R_j (R for short) of a mini-match, R_j = SUM_i x_ij, we have

E(R) = SUM_i E(x_ij),

and
var(R) = E(R*R) - E(R)*E(R) = E((SUM_i x_ij)*(SUM_i' x_i'j)) - (SUM_i E(x_ij))*(SUM_i' E(x_i'j)
= SUM_ii' { E(x_ij*x_i'j) - E(x_ij)*E(x_i'j) }
= SUM_i var(x_ij) + SUM_(i,i') cov(x_ij,x_i'j),

where the last line comes from grouping the double sum over i and i' in terms with i=i' in the first sum, and i != i' in the second sum. Now you claim that cov(x_ij,x_i'j) = 0 for every i != i', so we have:

var(R) = SUM_i var(x_ij).

now 0 <= var(x_ij) <= 1 for every i, so we have

0 <= var(R) <= M,

where M is the number of games in a minimatch.

For 'expectation value', you can also read 'average over all mini-matches j', and the derivation also holds for that.

So according to your statement about the correlation of the game results, the variance of the mini-match results is below M (i.e. the standard deviation is below sqrt(M)).

This is in contradiction to your claim that on the average (over j) you see a variance of R much larger than M. So are you willing now to drop that claim about the variance? Or are you maintaining both claims, and don't bat an eye when adding M terms smaller than 1 produce a result larger than M? :roll:

Like Uri says, the correlations and covariances are just quantities derived from your dataset, and from what you told (or led us to believe...) is that the var(R) > M. As we _know_ the variance of the individual game results to be bounded by 1, we thus thus know, by your own claim, that the covariances in your data set are nonzero.

All you do is argue that there cannot be any _causal_ relationship between the games. But that has no relevance for the fact that the _correlations_ apparantly are there, in your dataset. (Unless you are just bullshitting us.) And you claim that this is a typical behavior, not a one-time fluke because of an extremely unlucky start, that goes away after long-enough data collection and averaging. You claim that you keep _consistently_ seeing this over-representation of rare deviations in your data. So the burden is on you now, to explain how this correlation could maintain itself in the absense of a causal relationship.

To someone that has only the slightest insight in statistics, your claims are of course as idiotic as when someone would claim that new measurements with his new expensive machine has proved that the fraction of oxygen in air is not 20%, but 110%. He will be met with ridicule, even by persons that don't know how his equipment works. And quite justly so, as uncritically publishing such obviously faulty data is simply bad science...
I have no idea where you are heading with this. Let me recap _my_ claim, and bypass all the other noise that is introduced...

I ran a 5000 game test between Crafty and one of the opponents. I think it was glaurung but I am not sure. At the time control I used, the average result of the 80-game matches was -1 (I am talking about the most recent set of data since I still have that handy). I then reported the results of the individual 80-game matches, as well as the overall average. Some of those results were _way_ off from what was expected.

In the original data where I posted 4 strings of +/=/- results, I did the same thing, and just grabbed the first 4 results from the 64 matches that were actually played. I reported those results.

1. No, the results are not correlated in the usual sense of the word, because I know every game is an independent test.

2. No, I don't care whether a correlation test suggests correlation or not. The results are random enough that a single 80-game sample (1/64th of the total results) could suggest anything.

3. The data simply was whatever I had available at the time of the post. I don't keep 95% of the results we produce. Keeping up with what means what is not worth it. I keep summaries (as in my most recent posting giving the results of all 64 matches, and then averaging them in groups) of the results for versions we keep, until they become superceded by a newer version. I always have the results for the "current code", and for all previous versions (20.0, 20.1, ...) as well, but not for rejected versions or versions that were later replaced with a better one. 21.6a vs 21.6b, for example, where 21.6a results are discarded once we are sure 21.6b is better.

I have made no claims about correlation or covariance. I have said that the results are far more random than I expected, and based on lots of results posted here, they are far more random than anyone else expects either since hardly anyone reports on 100 game matches, and 100 games matches are nowhere near enough to smooth out the randomness I see in the set of programs I am testing.

That's all I have said. There is nothing false or misleading in that. So the nonsense about "is it correlated or is that kind of randomness impossible" is 100% pointless. It is what it is. Nothing more, nothing less. You can accept it, or ignore it, I really don't care. But it is what it is. And all the shuffling around, statistical analysys, theoretical discussions, opinions, theories, guesses, don't mean a thing. Because it is what it is...

It is time to move on. If those original 4 matches were a 10 sigma event, it is what it is. Nothing more, nothing less.

What else is left to say here? I am interested in reducing the standard error to something low enough that I can consider it zero. 1 in 20 is not that error rate. If you want to accept that or higher, fine. If your program is primitive enough that it doesn't exhibit non-deterministic play, fine. But every program I have personally tested and use in my tests certainly does do this. So for _most_ of us, the number of games needed to predict progress with high confidence is quite large. And all the arguing in the world is not going to change that.
Uri Blass
Posts: 10903
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

If you play against weak opponent and win all games results are not correlated.

The extreme example of corelation is if in one match Crafty win all games
against the same opponent and in another game Crafty lose all games
against the same opponent.

Winning often against an opponent is not a correlation if it happens in all matches.


H.G.Muller explained some mathematics about covariance in another post
and explained
how he got the conclusion that there is a correlation even without knowing
all the data.


Uri