An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

hgm wrote:Well, computers are designed to be deterministic machines, and if you ignore time, there wouldn't be any variability at all, ever. And in many simple engines don't even read the clock, and those that do hardly act on it. That is just as much emperical fact as Bob's variability for his engine. In the games between uMax and Eden that Nicolai was running in the other thread, all white games (4 so far) are the same, and all black games (3 so far) are the same, move for move up to the very end.

But that is not really an issue, as it is all trivially understood from how the engines are programmed to function, and the magnitude of the timing noise. I did the calculation in an earlier post, and it matched the observations on uMax and Eden quite well. And, considering the different time management of Crafty, it is understandable that this engine is nearly two orders of magnitude more sensitive to timing fluctuations. So all of that is "old hat".
If that all is "old hat", I don't get the hype and vanity hurts, whether the matches where random or worst case selections! Bob's four posted results 1..4 look perfectly OK, I don't understand what you want to prove. The waste of ressources to play soo many games for statistical relevance?
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

Gerd Isenberg wrote:
hgm wrote:
bob wrote:What is that based on? Reading tea leaves?
No, on mathematical calculation / proof. Something that, by now, I am beginning to fear is an utterly alien concept to you.
I have been running these kinds of tests for _many_ years. Many is >= 30. So I have some clue about how to run game tests where things don't get biased by the testing itself. Each game is played separately. Games are pleyed on different systems. I monitor the load carefully while a game is in progress to make sure that nothing unusual happens (it rarely does, but it is not a zero-probability event).

There is absolutely no way that game X influences the result of game Y. It is physically impossible. So we can get off that bandwagon before it leaves the station.
So explain us then how the results of games X and Y become correlated...

For if they are not correlated, the variance of the min-match results would be limited to sqrt(80) ~ 9, and even ~7 if you have draws, so results >=30 are 4-sigma fluctuations, and should not occur more often than once every 15,000 mini-matches, or so. Not "many times".

The only way you can have a probability distribution of the mini-match results other than a normal one with vSD = ~0.8*sqrt(80) = ~7, is to have correlation between the 80 games. That is a hard mathematical fact, as certain as that 1+1=2.
So you either should drop your claim that extreme deviations occur much more frequency than such a normal distribution would predict, or you would have to admit that there is correlation between games. (Or you could of course start claiming that all mathematics that has been done since Euclid is just nonsense...)
You sound like a theoretical weisenheimer to me ;-)
I have no clue from statistics, but somehow Bob's result sounds more plausible to me and your deterministic ones suspect.

Assuming programs terminate their search preliminary before they finish an iteration - based on calling time every N nodes, I would have expected such a random result from a set of balanced position matches between quite equal strong opponents. Keeping the hash between searches amplifies very minor changes. IMHO simple application of chaos theory.

There is a multithreaded OS environment. Context-switch granularity, shifting phase between polling time after N nodes to some clock counters, processor-thread affinity, page- and cache issues, huge (chaotical) processor heuristics like tlb and btb, other running processes/threads (even if sleeping most of the time) etc.. The number of instructions per time may vary a lot per process/thread. They may vary by N or up to some +-1E5 nodes per search (+-0.1 sec) - leaving different hash-footprints.

In quite positions with a lot of equivalent good moves, even very minor move sorting changes by slightly different hash-footprints may conspirative result in other moves at the root up and then, specially in selective programs not pure alfa-beta. One time is enough to get a different game with another outcome, no matter from either side. "Complexity" of evaluation, thus number of terms and "noise" may amplify that none determinism as well - as well as subtle "constructive" bugs in eval due to initialization issues or to agressive compiler optimization ;-)

Weaker programs with unbalanced knowledge and mutual exclusive holes may somehow more fragile to play those positions more deterministic.

Running multiple programs on multiple cores seems another amplifier. If one thread takes advantage of some lucky ressource mappings, other processes are likely more unlucky.

Better suited to play more significant matches with less games would be to use two single core computers with a single thread, DOS like (real-time) OS. And/or both programs terminate on their target-time and NPS estimated number of nodes. Persistant NPS[matkey], "learned" before in different runs over all kind of material - constellations.
Gerd,

You do not understand the point of H.G.Muller.

His claim is that the data that Bob gave earlier from the first 4 matches against the same opponent is something that you expect to be very rare even if the result of every game is decided by pure luck and the matches are from random positions so the positions of one match do not repeat in another match.


He did not claim that it is impossible that the programs are not deterministic in the same position.

The main question is if the variance in match result of 80 games is not bigger than the sum of the variance in the different games.

In theory it should be equal but the results made me suspect that it is not the case for some reason and maybe bob believes that the games are independent when they are not independent for some reason.

If you see that a coin fall on head 70 times out of 100 in the first 100 tries then it is logical to suspect that the coin is simply not fair(of course it is possible that the coin is fair and you had a bad luck).

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:What is that based on? Reading tea leaves?
No, on mathematical calculation / proof. Something that, by now, I am beginning to fear is an utterly alien concept to you.
I have been running these kinds of tests for _many_ years. Many is >= 30. So I have some clue about how to run game tests where things don't get biased by the testing itself. Each game is played separately. Games are pleyed on different systems. I monitor the load carefully while a game is in progress to make sure that nothing unusual happens (it rarely does, but it is not a zero-probability event).

There is absolutely no way that game X influences the result of game Y. It is physically impossible. So we can get off that bandwagon before it leaves the station.
So explain us then how the results of games X and Y become correlated...
You set 'em up, I'll keep knocking the softballs out of the park. the answer is simple...

They are _NOT_ correlated. It is pure idiocy to even consider that possibility since I have explained _exactly_ how I run test matches. With nothing carried thru hash tables, thru any sort of learning, there is no way there is any true correlation between games. It just can't happen.

Of course, feel free to postulate some scenario where it could happen and I can tell you whether it is possible in my test setup or not. But I certainly see no way, and consider this an irrelevant discussion based on what I am doing.



For if they are not correlated, the variance of the min-match results would be limited to sqrt(80) ~ 9, and even ~7 if you have draws, so results >=30 are 4-sigma fluctuations, and should not occur more often than once every 15,000 mini-matches, or so. Not "many times".
So what? I've presented the data. So either the data is made up, or there is some bizarre form of correlation at work? There is significant randomness in the results. I don't see any way to get rid of it. So I am working around the randomness by playing a significant number of games.

You can continue to dance around the statistical pole if you want, but the data is what it is, and one thing it isn't is correlated. Unless all chess programs play correlated games in tournaments, when learning is disabled. Yes there could be correlation with learning allowed. But it isn't in my tests.

The only way you can have a probability distribution of the mini-match results other than a normal one with vSD = ~0.8*sqrt(80) = ~7, is to have correlation between the 80 games. That is a hard mathematical fact, as certain as that 1+1=2.
So you either should drop your claim that extreme deviations occur much more frequency than such a normal distribution would predict, or you would have to admit that there is correlation between games. (Or you could of course start claiming that all mathematics that has been done since Euclid is just nonsense...)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Uri Blass wrote:
Gerd Isenberg wrote:
hgm wrote:
bob wrote:What is that based on? Reading tea leaves?
No, on mathematical calculation / proof. Something that, by now, I am beginning to fear is an utterly alien concept to you.
I have been running these kinds of tests for _many_ years. Many is >= 30. So I have some clue about how to run game tests where things don't get biased by the testing itself. Each game is played separately. Games are pleyed on different systems. I monitor the load carefully while a game is in progress to make sure that nothing unusual happens (it rarely does, but it is not a zero-probability event).

There is absolutely no way that game X influences the result of game Y. It is physically impossible. So we can get off that bandwagon before it leaves the station.
So explain us then how the results of games X and Y become correlated...

For if they are not correlated, the variance of the min-match results would be limited to sqrt(80) ~ 9, and even ~7 if you have draws, so results >=30 are 4-sigma fluctuations, and should not occur more often than once every 15,000 mini-matches, or so. Not "many times".

The only way you can have a probability distribution of the mini-match results other than a normal one with vSD = ~0.8*sqrt(80) = ~7, is to have correlation between the 80 games. That is a hard mathematical fact, as certain as that 1+1=2.
So you either should drop your claim that extreme deviations occur much more frequency than such a normal distribution would predict, or you would have to admit that there is correlation between games. (Or you could of course start claiming that all mathematics that has been done since Euclid is just nonsense...)
You sound like a theoretical weisenheimer to me ;-)
I have no clue from statistics, but somehow Bob's result sounds more plausible to me and your deterministic ones suspect.

Assuming programs terminate their search preliminary before they finish an iteration - based on calling time every N nodes, I would have expected such a random result from a set of balanced position matches between quite equal strong opponents. Keeping the hash between searches amplifies very minor changes. IMHO simple application of chaos theory.

There is a multithreaded OS environment. Context-switch granularity, shifting phase between polling time after N nodes to some clock counters, processor-thread affinity, page- and cache issues, huge (chaotical) processor heuristics like tlb and btb, other running processes/threads (even if sleeping most of the time) etc.. The number of instructions per time may vary a lot per process/thread. They may vary by N or up to some +-1E5 nodes per search (+-0.1 sec) - leaving different hash-footprints.

In quite positions with a lot of equivalent good moves, even very minor move sorting changes by slightly different hash-footprints may conspirative result in other moves at the root up and then, specially in selective programs not pure alfa-beta. One time is enough to get a different game with another outcome, no matter from either side. "Complexity" of evaluation, thus number of terms and "noise" may amplify that none determinism as well - as well as subtle "constructive" bugs in eval due to initialization issues or to agressive compiler optimization ;-)

Weaker programs with unbalanced knowledge and mutual exclusive holes may somehow more fragile to play those positions more deterministic.

Running multiple programs on multiple cores seems another amplifier. If one thread takes advantage of some lucky ressource mappings, other processes are likely more unlucky.

Better suited to play more significant matches with less games would be to use two single core computers with a single thread, DOS like (real-time) OS. And/or both programs terminate on their target-time and NPS estimated number of nodes. Persistant NPS[matkey], "learned" before in different runs over all kind of material - constellations.
Gerd,

You do not understand the point of H.G.Muller.

His claim is that the data that Bob gave earlier from the first 4 matches against the same opponent is something that you expect to be very rare even if the result of every game is decided by pure luck and the matches are from random positions so the positions of one match do not repeat in another match.


He did not claim that it is impossible that the programs are not deterministic in the same position.

The main question is if the variance in match result of 80 games is not bigger than the sum of the variance in the different games.

In theory it should be equal but the results made me suspect that it is not the case for some reason and maybe bob believes that the games are independent when they are not independent for some reason.

If you see that a coin fall on head 70 times out of 100 in the first 100 tries then it is logical to suspect that the coin is simply not fair(of course it is possible that the coin is fair and you had a bad luck).

Uri
Sorry, but I play blackjack frequently, and well. And I play "with advantage". And I have seen many "impossible events" over years of play. I should get a blackjack every 21 hands roughly. I have played for 5+ hours with a witness, at about 100 hands per hour, with absolutely zero blackjacks. The list goes on and on. If I flipped a coin 100 times and got 70 heads, I personally would not bat an eye. If I flipped it 10,000 times and got 7,000 heads I would be more concerned. Short samples of long trials can produce anything. If you flip a coin 1,000,000 times, you will find all sorts of "patterns" if you look hard enough. A long string of heads. or a long string of heads/tails alternating. Etc. But the main effect will be normally distributed around the 50-50 mean.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: An objective test process for the rest of us?

Post by jwes »

bob wrote:
jwes wrote:
bob wrote:
And suppose the first 100 games ends up 80-20, and the second (which you choose to not play) ends up 20-80? Then what.
What they are saying is that the variances you are quoting are much higher than you would get if it were a stochastic process, e.g if the probabilities of program A against crafty are 40% wins, 30% draws, and 30% losses, and you wrote a program that randomly generated sequences of 100 trials with the above probabilities, you would not have nearly the differences between these sequences that you have been getting. This would strongly suggest problems with the experimental design.
as I have mentioned repeatedly, you can see what causes the variability by running this test:
As we have mentioned repeatedly, the variability you are quoting is too high to be due to randomness. I believe this can be a result of the events in a trial not being independent, e.g. an engine being stronger or weaker than usual for all the games in a set. Do you keep track of the NPS or total nodes analyzed for each engine ? Another idea is to put a large quantity of data into SYSTAT or SPSS and look for unexpected correlations. Ask your statistics person for ways to analyze your data for statistical anomalies.
bob wrote:If you claim that is a fault of the setup, then feel free to suggest a solution. But the solution has to involve not modifying all the programs to search in a way that is different from how they normally work.
One idea is to set the other engines (but not crafty) to search to a fixed depth. This should reduce variability and make those engines play at a more consistent strength level.
Rein Halbersma
Posts: 751
Joined: Tue May 22, 2007 11:13 am

Re: An objective test process for the rest of us?

Post by Rein Halbersma »

bob wrote:And I have seen many "impossible events" over years of play. I should get a blackjack every 21 hands roughly. I have played for 5+ hours with a witness, at about 100 hands per hour, with absolutely zero blackjacks.
That's a pretty rare event indeed, with an probability of 2.5e-11 or odds of about 1 to 40 billion! Of course, that's assuming that every hand is independent from the previous ones, but I guess they play with >6 decks and reload the decks frequently?
bob wrote: The list goes on and on. If I flipped a coin 100 times and got 70 heads, I personally would not bat an eye.
Chance of that happening are about 3.9e-05 or odds of about 1 to 25,000. Most people would consider the coin biased after such a result.
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

bob wrote:
hgm wrote: So explain us then how the results of games X and Y become correlated...
You set 'em up, I'll keep knocking the softballs out of the park. the answer is simple...

They are _NOT_ correlated. It is pure idiocy to even consider that possibility since I have explained _exactly_ how I run test matches. With nothing carried thru hash tables, thru any sort of learning, there is no way there is any true correlation between games. It just can't happen.

Of course, feel free to postulate some scenario where it could happen and I can tell you whether it is possible in my test setup or not. But I certainly see no way, and consider this an irrelevant discussion based on what I am doing.
This is simply not correct.

Claiming that result x is correlated with result y is a mathematical fact.

You can say that it was because of luck but it is a mathematical fact and not a claim that variable x is dependent on variable y.

Note that you can always expect 2 variables to be correlated even if they are independent but usually you will get small correlation and you will be able to accept the conjecture that they are independent.

In this case the event that happened can happen only in one case out of million assuming that the variables are independent and assuming that H.G.Muller calculation are correct(I did not care to calculate the exact number).

It means that if you test H0 (they are independent) against
H1 they are dependent you will accept H1 unless you use alpha that is smaller than 1/million.

You can explain it by luck but it is logical to suspect that something is wrong in the condition of the experiment.

Uri
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:They are _NOT_ correlated. It is pure idiocy to even consider that possibility since I have explained _exactly_ how I run test matches.
Good! So if x_i, is the result of game i in mini-match j, you claim that the correlation between x_ij and x_km, which is defined as cov(x_ij,x_km)/sqrt(var(x_ij)*var(x_km)) equals 0. (Cov stands for covariance.) Thus you claim that

cov(x_ij,x_km) = 0,

as the variances of the individual game results are all finite and limited to 1 (as the results are limited to +1, -1). Now covariance is defined as

cov(x_ij,x_km) = E(x_ij*x_km) - E(x_ij)*E(x_km),

with E(x) the expectation value of quantity x.

Now for the result R_j (R for short) of a mini-match, R_j = SUM_i x_ij, we have

E(R) = SUM_i E(x_ij),

and
var(R) = E(R*R) - E(R)*E(R) = E((SUM_i x_ij)*(SUM_i' x_i'j)) - (SUM_i E(x_ij))*(SUM_i' E(x_i'j)
= SUM_ii' { E(x_ij*x_i'j) - E(x_ij)*E(x_i'j) }
= SUM_i var(x_ij) + SUM_(i,i') cov(x_ij,x_i'j),

where the last line comes from grouping the double sum over i and i' in terms with i=i' in the first sum, and i != i' in the second sum. Now you claim that cov(x_ij,x_i'j) = 0 for every i != i', so we have:

var(R) = SUM_i var(x_ij).

now 0 <= var(x_ij) <= 1 for every i, so we have

0 <= var(R) <= M,

where M is the number of games in a minimatch.

For 'expectation value', you can also read 'average over all mini-matches j', and the derivation also holds for that.

So according to your statement about the correlation of the game results, the variance of the mini-match results is below M (i.e. the standard deviation is below sqrt(M)).

This is in contradiction to your claim that on the average (over j) you see a variance of R much larger than M. So are you willing now to drop that claim about the variance? Or are you maintaining both claims, and don't bat an eye when adding M terms smaller than 1 produce a result larger than M? :roll:

Like Uri says, the correlations and covariances are just quantities derived from your dataset, and from what you told (or led us to believe...) is that the var(R) > M. As we _know_ the variance of the individual game results to be bounded by 1, we thus thus know, by your own claim, that the covariances in your data set are nonzero.

All you do is argue that there cannot be any _causal_ relationship between the games. But that has no relevance for the fact that the _correlations_ apparantly are there, in your dataset. (Unless you are just bullshitting us.) And you claim that this is a typical behavior, not a one-time fluke because of an extremely unlucky start, that goes away after long-enough data collection and averaging. You claim that you keep _consistently_ seeing this over-representation of rare deviations in your data. So the burden is on you now, to explain how this correlation could maintain itself in the absense of a causal relationship.

To someone that has only the slightest insight in statistics, your claims are of course as idiotic as when someone would claim that new measurements with his new expensive machine has proved that the fraction of oxygen in air is not 20%, but 110%. He will be met with ridicule, even by persons that don't know how his equipment works. And quite justly so, as uncritically publishing such obviously faulty data is simply bad science...
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:8 million PGN files is simply unmanagable.
It is possible to put them into 8 1-million-game files, or 80 100,000-game files, ... you could even store them in a more efficient format than PGN.

You're a CS prof, surely you can devise a solution?

So I guess it's not that you were unable to store them, but that you saw no need for storing them.

Which is perfectly acceptable.


Also note that for statistical analysis, you don't need the actual games at all, only the results.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

Rein Halbersma wrote:
bob wrote:And I have seen many "impossible events" over years of play. I should get a blackjack every 21 hands roughly. I have played for 5+ hours with a witness, at about 100 hands per hour, with absolutely zero blackjacks.
That's a pretty rare event indeed, with an probability of 2.5e-11 or odds of about 1 to 40 billion! Of course, that's assuming that every hand is independent from the previous ones, but I guess they play with >6 decks and reload the decks frequently?
bob wrote: The list goes on and on. If I flipped a coin 100 times and got 70 heads, I personally would not bat an eye.
Chance of that happening are about 3.9e-05 or odds of about 1 to 25,000. Most people would consider the coin biased after such a result.
But those were the results he got after how many flips? It is a mistake to select the one noticeable event and claim it is particularly remarkable.

It would be remarkable if this was on the only experiment. But you remember it only as standing out because the other experiments where this didn't happen.


Flipping a coin and getting 70 heads, or getting 70 heads in succession? Are you both talking about the same thing? I am fairly sure that the probability of getting 70 heads in all instead of the expected 50 is not remarkable, and thus I would also not bat an eye.