bob wrote:hgm wrote:bob wrote:..., otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now.
???
Seems a fact to me that you complained about getting imposible results. These were not my words.
What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.
Impossible in the practical sense is a relative notion. In principle, it is possible that, when you step out of your office window on the fifth floor, the motion of air molecules below you experiences a random fluctuation so you remain floating in the air. Do not try it! You can be certain to drop like a stone. If you see people floating in the air, without any support, you can be certain that they have a way of supporting themselves, and that you merely not see it. Likewise here.
And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.
Not in any useful and pragmatic sense. Some results are too unlikely to be acceptable. It all relates to Bayesian likelyhood. There are alternative explanations to the observation. You can never exclude all of them with absolute certainty, some of them you might not even be aware of. One of your students might want to pull a prank on you, and substituted one of the engine executables or the result file. There might be ways to doctor the results we cannot even imgine. One never can exclude everything to a level of one in a billion. And if an explanation remains with a prior probability of more than one in a billion, it will outcompete the explanation that everything was as it should have been, and the result is only due to a one-in-a-billion fluke. Because the Bayesian likelyhood of this null-hypothesis being corrrect has just be reduced to the one-in-a-billion level by the occurrence of the result. While the smallness of some of the alternative hypotheses relies solely on the smallness of their prior probability.
That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory".
Learn about probability theory, and you might be able to grasp it. It is not that difficult. Every serious scientist in the World uses it.
I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent.
And none will be forthcoming as long as you don't enable me to do a post-mortem on your data set. Do you really expect that you can tell people just a single large number, and then ask them to explain how you got that number? That does not seem to indicate much realism on your part.
I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?
Are remarks like these meant to illustrate how mathematicaly inept you are, or what? Between approximately equal opponents the probability for a 0-4 or a 4-0 result is ~2.5%, (assuming 20% draw rate), and your ability to show them relies on selecting them from a set of tries much larger than 40. This remark displays the same lack of understanding as the one about winning the lottery. of course people whin the lottery. The product of the probability a given individuaal wins and the number of participants is of order one.
One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...
Fat chance, captain... For I have been an experimental physicist al my life, and I know how to collect data, and how to recognize collected data for crap when there was a fault in the data-collection equipment. When things are really as they should not be, it shows that you goofed. Mathematics is indeed a a rigid endeavor, and results that violate mathematical bounds are more suspect than any other.
But the main difference between us is that if I would have faulty equipment, I would analyze the erroneous data in order to trace the error and repair the equipment. Your abilities seem to stop at trying it again in the same way, in the hope it will work now...
bob wrote:
That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness.
I was not asking for the cause of randomness, but for the cause of the ccorrelation between results of games in the same run your data contains.
I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.
Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.
So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...
It does not seem possible for timing jitter to systematically favor one engine over the other for an extended stretch of games. So it is not an acceptable explanation of the observed correlation. There thus must be another cause fo this, implying also that other testers might not be plagued by this at all.
How exactly did you guard against one of the engines in your test playing stronger on even months, and weaker in odd months?
So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.
The probability for such an observation in a singe 400-game run is 1 in 12 million (assuming 20% draw rate). So if you have tried a few million 400-game runs, it would not be very remarkable if you observed that.
If it occurred in just a few hundred 400-game runs, you can be pretty sure somethin is going on that you don't understand, as the unlikelihood that his would be a genuine chance event starts to rival the prior unlikelihood of alternative hypotheses that you could at best guarantee, no matter how much you tries. The likelyhood of the alternatives would probably not yet exceed the probablity for a statistical fluke by so much that you could safely bet your life on it, though.
Rambling leads nowhere. How can a processor favor one opponent consistently over several games?
That is for you to figure out. They are your processors, after all. When doing a post-mortem to find cause of death, one usually cuts open the body to look inside. You still expect us to see it from only looking at the outside (= overall score). Not possible...
...
I doubt you could put together a more consistent test platform if you had unlimited funds.
Well, so maybe it is the engines that have a strength that varies on the time-scale of your runs. Who knows? This line of arguing is not productive. You should look at the data, to extract the nature of the correlation (time-wise, core-wise, engine-wise). String at the dead body and speculating about what could not have caused its death is no use.
Totally different animal there.
Wrong! If your data is noisy or not, is never decided by the factory specs and manufacturer guarantee of your equipment. The data itself is the ultimate proof. If two measurements of the same quantity produce a different reading, the difference is by definition noise. What you show us here is low-quality, noisy data. If the test setup producing it promised you low-noise, good quality data, ask your money back.