Dann Corbit wrote: ↑Fri Jul 03, 2020 11:52 pm
I am addressing the question pieces at a time, and the draws are an important facet. I was just making the point that you don't believe in LOS either, because you do not trust the LOS number.
I mean, if LOS says that the change is almost certain to be an improvement, why wouldn't you use it?
Clearly, you believe that randomness is involved and therefore the 8 wins are not enough data to make a sound decision.
Of course randomness is involved in any match between non-deterministic engines. That does not mean no statistically sound conclusion can be drawn.
If all I know is 8-0-2 (selected without bias) and I can only accept or reject the patch that just changes some parameter values, I would accept it. It is more likely than not an improvement and I don't care whether that parameter is set to 5 or to 6.
However, where did that 8-0-2 come from? If someone is running a test and interrupted it after 10 games BECAUSE he saw 8-0-2, then the result is basically meaningless. Or if someone has been running a variety of tests, then happens to run a 10-game match which just happens to give 8-0-2, I would also decide to run my own tests. It is easy to inadvertently cheat with statistics.
And yet (I believe) the 8-0-2 LOS is more sound than an 8-0-1000 LOS because the first set of measurements is a better indicator that A is probably stronger than B than the second set. We want to know a binary answer "A is stronger, true or false". The draws give evidence that we did not have before we collected all the draws.
If the hypothesis is that A and B are equally strong and you run a match until 8 decided games, then the number of draws is no indicator of whether the hypothesis should be rejected or not. Only the W/L ratio is. I have explained this already.
Each decided game ends in win or loss, each with probability 0.5. The probability of 8 wins is 1/256. The number of games does not influence that. You cannot speak of "error margins" with respect to the calculated probability 1/256. (I am ignoring white/black advantage to keep things simple.) Of course 1/256 can happen just by chance, it will happen if you repeat often enough, or it can happen because you first run the match and only after seeing the outcome decided what pattern to look for. Doing statistics right is very tricky. But the number of draws makes no difference. (In particular if you decide upfront to throw out the draws! Obviously it is entirely sound to decide to run an experiment that throws out the draws and try to draw a statistical conclusion based just on the W/L ratio.)