Here is a simple experiment for you to run. Go back to my original post at the top of this thread. If you are using or can use firefox, then click "edit", then "find", type "impossible" in the box and tell me where that appears in my post anywhere...hgm wrote:???bob wrote:Aha. opinions rather than facts now.hgm wrote:Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...bob wrote:..., otherwise I am getting _way_ too many impossible results.
Seems a fact to me that you complained about getting imposible results. These were not my words.
Been using it for years in other things. Including my mentioned blackjack card counting exploits. And these streaks are not unusual at all. How many consecutive blackjacks do you believe would be "impossible" in real play? I've gotten seven in a row. I can explain how/why quite easily. I'll bet you will say that is "impossible". I have lost 31 consecutive hands in a game where I have a slight advantage. "impossible, you say". And I have not played millions of hands either. I sat at a table at the Tropicana 3 years ago and played for one hour with 3 witnesses looking on, I played 85 hands, and excluding 3 pushes that were spaced out, I lost every last one. And not intentionally as it was _my_ money.Impossible in the practical sense is a relative notion. In principle, it is possible that, when you step out of your office window on the fifth floor, the motion of air molecules below you experiences a random fluctuation so you remain floating in the air. Do not try it! You can be certain to drop like a stone. If you see people floating in the air, without any support, you can be certain that they have a way of supporting themselves, and that you merely not see it. Likewise here.What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.Not in any useful and pragmatic sense. Some results are too unlikely to be acceptable. It all relates to Bayesian likelyhood. There are alternative explanations to the observation. You can never exclude all of them with absolute certainty, some of them you might not even be aware of. One of your students might want to pull a prank on you, and substituted one of the engine executables or the result file. There might be ways to doctor the results we cannot even imgine. One never can exclude everything to a level of one in a billion. And if an explanation remains with a prior probability of more than one in a billion, it will outcompete the explanation that everything was as it should have been, and the result is only due to a one-in-a-billion fluke. Because the Bayesian likelyhood of this null-hypothesis being corrrect has just be reduced to the one-in-a-billion level by the occurrence of the result. While the smallness of some of the alternative hypotheses relies solely on the smallness of their prior probability.And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.Learn about probability theory, and you might be able to grasp it. It is not that difficult. Every serious scientist in the World uses it.That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory".
Some of can do that, yes. I have had to chase more operating system bugs over my career that were presented in that same way. Quite often one can not produce a large log. Any more than we can do that for SMP searches. Sometimes you are left with the idea of sitting down and asking yourself "How can this possibly happen" and then looking at the code and data structures, start listing possible ways the symptoms could occur. Then by carefully analysing the code, that list is pared down, hopefully until there is just one entry left that can't be excluded. Same approach for parallel search bugs. It is very uncommon to have enough information to find such bugs outright. Perhaps you don't have experience in either of those areas and therefore have not had to debug what is very difficult to pin down. But some of us do it regularly. And yes, it _is_ possible.And none will be forthcoming as long as you don't enable me to do a post-mortem on your data set. Do you really expect that you can tell people just a single large number, and then ask them to explain how you got that number? That does not seem to indicate much realism on your part.I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent.
I didn't ask you to tell me "how I ended up with that number..." I clearly asked "can you give me any possible (and by possible, I do mean within the context of the hardware and operating system and configuration I have previously clearly laid out) way I could cause dependencies between games. That has nothing to do with my specific data. It is simply a question that leads to the conclusion that this does not seem possible without doing something so egregious that it would be both intentional and grossly obvious.
Or one can continually shout "can't be, can't be..." and hope that shouting it enough will make it so. But since you want to take the physics road, last time I saw some odd thing reported, rather than everyone shouting "can't be, can't be" and stamping their foot, they attempted to repeat the result. So perhaps you are not quite the physicist you think you are. Have you tried any duplicated tests to see what kind of variability _you_ get in such testing? Didn't think so.Are remarks like these meant to illustrate how mathematicaly inept you are, or what? Between approximately equal opponents the probability for a 0-4 or a 4-0 result is ~2.5%, (assuming 20% draw rate), and your ability to show them relies on selecting them from a set of tries much larger than 40. This remark displays the same lack of understanding as the one about winning the lottery. of course people whin the lottery. The product of the probability a given individuaal wins and the number of participants is of order one.I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?Fat chance, captain... For I have been an experimental physicist al my life, and I know how to collect data, and how to recognize collected data for crap when there was a fault in the data-collection equipment. When things are really as they should not be, it shows that you goofed. Mathematics is indeed a a rigid endeavor, and results that violate mathematical bounds are more suspect than any other.One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...
But the main difference between us is that if I would have faulty equipment, I would analyze the erroneous data in order to trace the error and repair the equipment. Your abilities seem to stop at trying it again in the same way, in the hope it will work now...
And I asked you to postulate a single reasonable explanation of how I could finagle things on the cluster to make that happen, without knowing which program I am supposed to be favoring to somehow make dependent results.I was not asking for the cause of randomness, but for the cause of the ccorrelation between results of games in the same run your data contains.bob wrote: That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness.
It does not seem possible for timing jitter to systematically favor one engine over the other for an extended stretch of games. So it is not an acceptable explanation of the observed correlation. There thus must be another cause fo this, implying also that other testers might not be plagued by this at all.I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.
Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.
So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...
"might not be plagued" is the crux of the issue. Anyone can attempt to verify my results by just playing matches. Then we would know. I know that I have gotten the same results on two different and not connected systems. I have gotten the same results using xboard and my referee. The only common elements are:
(1) 40 starting positions used over and over
(2) same 6 program executables, used over and over
(3) same operating system kernel and configuration used over and over
Everything else has varied, from hardware, to referee program. The programs can't do anything to make results dependent, because after each game all files are removed and then the next game is played with no way to communicate anything other than perhaps uninitialized hash which none of the programs I am using fall into. The operating system could somehow "decide" that if A beats B, in game 1, then the next time they play it is going to bias the game for either A or B to make the second result dependent on the first. That is a bit of a stretch. If it does that randomly then it would not be a dependent result. I can't come up with any even remotely plausible way this could happen. I write that off quicker than you write off the randomness of the results.
Why would I? these matches last under a day for 25,000 games. Of course they could play stronger on even hours and weaker on odd hours. And if so _you_ have exactly the same problem.
How exactly did you guard against one of the engines in your test playing stronger on even months, and weaker in odd months?
And now you may well hit on the crux of the problem. Do you test against fruit? Or glaurung 1 or 2? I use both. The rating lists use both. If that is a problem on my testing, it is a problem on _all_ testing. Do you get that important point?The probability for such an observation in a singe 400-game run is 1 in 12 million (assuming 20% draw rate). So if you have tried a few million 400-game runs, it would not be very remarkable if you observed that.So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.
If it occurred in just a few hundred 400-game runs, you can be pretty sure somethin is going on that you don't understand, as the unlikelihood that his would be a genuine chance event starts to rival the prior unlikelihood of alternative hypotheses that you could at best guarantee, no matter how much you tries. The likelyhood of the alternatives would probably not yet exceed the probablity for a statistical fluke by so much that you could safely bet your life on it, though.That is for you to figure out. They are your processors, after all. When doing a post-mortem to find cause of death, one usually cuts open the body to look inside. You still expect us to see it from only looking at the outside (= overall score). Not possible...Rambling leads nowhere. How can a processor favor one opponent consistently over several games?Well, so maybe it is the engines that have a strength that varies on the time-scale of your runs. Who knows? This line of arguing is not productive. You should look at the data, to extract the nature of the correlation (time-wise, core-wise, engine-wise). String at the dead body and speculating about what could not have caused its death is no use....
I doubt you could put together a more consistent test platform if you had unlimited funds.
[qute]
Wrong! If your data is noisy or not, is never decided by the factory specs and manufacturer guarantee of your equipment. The data itself is the ultimate proof. If two measurements of the same quantity produce a different reading, the difference is by definition noise. What you show us here is low-quality, noisy data. If the test setup producing it promised you low-noise, good quality data, ask your money back.[/quote]Totally different animal there.
Would you like to see measured clock frequency for all 260 cpus on olympus and 560 cores on Ferrum? want to see valgrind output for cache statistics for each? We've run that kind of stuff. The nodes run a typical cluster system called "Rocks" that simply blasts the O/S image/files to each node on a cold start so that we do not have to take a week to upgrade each node independently. They actually _are_ identical in every way that is important, but even that is irrelevant as using different machines should not cause problems else most rating lists are worthless since they do this all the time.