hgm wrote:bob wrote:hgm wrote:And this is the point that you don't seem to get. It doesn't matter how random the influences are. Random is good. The numbers quoted by BayesElo (which, in your case, you could also calculate by hand in 15 seconds) assume totally independent random results, and are an upper limit to the difference that you could typically have between repetitions of the same experiment.
Apparently not. I just gave two runs that violated this completely. So it is _far_ from absolute. And again, I will say that this apparently doesn't apply to computer games, otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now. What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.
If you think that all of mathematics can be dismissed because you lack the imagination to pinpoint your source of errors, and fix it, think again. Mathematics is not subject to experimental testing or verification. No matter how often and in how many different ways you can calculate for us that 25 x 25 equal 626, it would only serve to show that you don't know how to multiply. The correct answer would still remain 625.
And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.
That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory". I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent. I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?
One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...
bob wrote:
hgm wrote:
Your results violate that upper bound, and hence the assumptions cannot be satisfied. If you cannot pinpoint the source of the depedence between your games, and eliminate it, your large-number-of-games testing will be completely useless. If I would have designed a laser-ruler to measure distances with sub-nanometer precision, and the typical difference between readings on the same steel needle would be more than a millimeter, I would try to repair my laser-ruler. Not complain about the variability of the length of needles...
Right. And I have explained exactly why there is no dependency of one game on another.
So your explanation is wrong. Mathematical facts cannot be denied. Apparently the environment is not as controlled as you think.
The game results are correlated. (Mathematical fact.) Explain me how they can get correlated without a causal effect acting between them, or show us what the causal effect is.
That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness. I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.
Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.
So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...
Each is run individually, on a very controlled environment. The same copy of an engine is used each and every time. whatever they do internally I don't care since that is a variable everyone will have to deal with. I'm tired of this nonsense about games are somehow dependent. When it just can't happen. There are other explanations, if you would just look. One is inherent randomness and streakiness of chess games.
What an incredible bullshit. A freshmen's course in statistics might cure you from such delusions. There is no way you can violate a Gaussian distribution with the calculated with independent results. Do the freshman's math, or ask a competent statiscian if you can't...
So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.
One is the timing issue that can never be eliminated in normal chess play. that does not produce direct dependencies. If running consecutive games somehow biases that computer to favor one opponent consistently, there is nothing that can be done because there is no logical way to measure and fix such. It is more pure randomness thrown in...
If the computer favors one opponent consistenly over several games, there will be a causal effect that causes it. And of course that can be fixed,
as there is no logical need for such a causal link to exist. And even if you don't know the underlying physical mechanism of such causation, you can eliminate it if you know its behavior from observation. e.g. if games played close together in time tend to give correlated results, you can interleave the two runs that you want to compare, by alternating games from one run and the other, so that they will be affected the same way by the mysterious effect.
Rambling leads nowhere. How can a processor favor one opponent consistently over several games? Please explain that in the context of the current Linux OS kernel and the straightforward O(1) scheduler it uses. It is easy enough to measure this if you want to believe that is a problem. But here's the flaw. No pondering. No SMP. So in a single game, only one process is ready to run at any instant in time. So how can you favor A, when about 1/2 the time A is not wanting to execute? That would be a tough idea to sell _anybody_ with any O/S experience of any kind.
how else could it favor besides scheduling? Giving one process more cache? Hardware won't do that under any circumstance preferentially. TLB entries? Ditto. There is a minor issue of page placement in memory where bad placements can cause some cache aliasing that can alter performance by a relatively small amount. And it is completely random usually. But in the last test I ran, that wasn't possible. Each node was warm-started before each game, so that everything was as close as possible to the same starting condition (probably never exact, since the boot process starts several temporary processes and they will suffer from timing randomness which will affect the free memory list order in unpredictable ways. But to bias a single opponent repeatedly would not be possible as it currently works.
So, what is left? No I/O going on, so no way to fudge there. Neither process / program is using enough memory to cause problems. They are set to as equal as possible with respect to hash sizes, I'm using between 96M (crafty) and 128M (for those that support that size) on nodes that have 4 gigs for olympus, or 12 gigs on ferrum (results being discussed were all from olympus).
I doubt you could put together a more consistent test platform if you had unlimited funds.
There ar in fact zillions of such tricks. consult an experimental physicist. They can tell you everything about the proper ways to do data collection in noisy environments.
Totally different animal there.