Gerd,
this discussion is
not about randomness in playing games. This point was already solved in a previous thread, by my theoretical prediction of how large these effects would be in engines with various time-management strategies, and by the later tests of Eden and uMax that followed. These were in total agreement with the predictions, and a such that problem can be considered 100% understood. Engines with simple time-management and little memory, like Eden and uMax, are about 97% deterministic, and repeat most games. This was of course well known to us, it is so obvious that you cannot miss it, and very annoying when you want to run gauntlets. Especially since many of the opponents at this level do not support the GUI commands to start in other positions then the opening.
What the current discussion is about, is this:
Nicolai asked me how much Eden 0.0.14 would have to score in his 26-game RR, before it could be considered better than Eden 0.0.13, w(which had scored 9 out of 26). Here it turned out that I (applying standard statistical analysis) and Bob have a fundamental difference:
bob wrote:hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.
If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.
Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Bob claims here that the variance of 26-game match results is much larger than the 2 points I calculated (according to the 0.4*sqrt(26) , with as a consequence that you would err much more often than 5% of the cases if you accepted the score threshold I calculated (15 out of 26).
As this is theoretically impossible if the games in a match are independent (i.e. totally random), we were skeptical wrt this claim. But Bob persistently kept claiming that this is experimental fact, that he sees it happen all the time, and that he can show us experimental data to prove his extraordinary claim. He than shows us the by now famous 80-game match results, where one of the traces contains a deviation of 29 points from the average.
Now if the games within the mini-match are independent, such a result _cannot_ occur more than once every 15,000 times, and there was a second not so very likely (though not so extreme) value next to it, together making it something that should not occur more than once in a million times. Well, if once in every 15,000 times I would accept a change as better because of such a fluke, that would have no impact on my confidence calculaion at all, as such 1-in-15,000 events are all included in the 5% error probability we wanted to risk. So either Bob is showing us very untypical data to "prove" his point, thereby suggesting that it is typical data, (which would be rather unethical, and therefore unlikely), or most of his data really has such large deviations, in which case there _must_ be something wrong with his measurement setup, as such extreme fluctuations are only possible if there is a large and significant correlation between the results of games within a mini-match, (which is also extremely unlikely, as they are intended to be independent).
So I know something stinks here, and I ask Bob if this is perhaps a hypothetical case. For some reason Bob didn't like this, and saw fit to respond with rude ad hominems. As it turns out now that these 4 traces were indeed _very_ atypical, you can draw your own conclusions...