Rémi Coulom wrote:Don wrote:Rémi Coulom wrote:That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.
Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.
Rémi
I did exactly that and came up with something pretty useful. One finding that should be pretty obvious when you think about it but that is surprising if you don't, is illustrated by the following thought experiment:
Assume that you 50% of your experiments are regressions, and 50% are improvements and that in total this is a zero sum game (the total ELO summing the regressions and improvements come out to zero.) Also, assume that you can generate these experiments at any rate of speed desired (in other words, no setup time between experiments.) What is the ideal number of games per match to run to determine whether to keep a change or not and then move on to the next experiment? The answer is non-intuitive, at least to me.
One?
Rémi
Yes! It was one of those things that are not immediately intuitive but once you think about it, yes, it should be obvious.
But when the number of regressions is more than 50%, which is probably the case for most of us, the number of games required to resolve them goes up substantially. It's almost depressing, but if, for example, 10% of your changes are actual improvements and they are all difficult to measure (small improvements or regressions) then unless you are very carefully you are going to be overwhelmed by noise and accept more regressions that improvements. Even if 1 out of 3 experiments are tiny improvements you have to play tens of the thousands of games.
The conclusion is that if you are at the point where improvements are pretty small and so are the regressions, the percentage of good vs bad experiments is a huge factor in the equation. If improvement are few and far between then 100,000 games is probably not enough. The only shortuct to testing is to be willing to throw out changes that very well may be improvements. In other words to avoid micro-regressions you must throw out anything that is not a clear improvement unless you are willing to put in the testing resources.
So what we did was design a "wald-like" system based on the simulation I did. The idea is to spend more resources when you need it to resolve the changes and the only way I know to attach real percentages to the numbers I choose is to run simulations as you suggested.
So we have a fixed number of openings - and the test runs to completion unless we hit a stopping rule. The stopping rule is that we stop if we are behind by X number of games or that we are ahead by Y number of games. It turns out that we can minimize resources even more by modifying X and Y as the games progress but not by not by too much. So Y is reduce linearly to 80% of it's original value - in other words if we start at 100 it will be 80 when the simulation ends. I'm not sure of the math principle behind this but it clearly produces better results with less effort on average. I am sure that all I am doing is approximating the (more correct) math behind this. If the test runs to completion without the test being stopped we don't keep the results.
One thing I forgot to mention is that we also generated draws using the actual statistics that our equivalents tests are generating. I think the number change when the draw ratio changes.
The interesting part of this is that by using statistics from the simulations I can tell you what the percentage of false positives or negatives exist in our samples and how many small regressions we are likely to keep as well as how many improvements (of a given magnitude) we are likely to discard. A useful number is also what percentage of neutral changes are kept. My thinking is that this number should be low - for example the straightforward "play 50,000 games and keep it if it score above 50%" will keep a lot of tiny regressions. So if you are not throwing away the really close results you are keeping a lot of regressions.
This comes down to how many games are you will to play to decide whether to keep a change and how much risk are you willing to accept. Regardless of your answer you should at least attempt to maximize your resources.
We are not actually using this right now, the simulations are based only on 2 player matches - using the model of testing one change against the previous best version. That's not how we usually test. But it was a very interesting experiment.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.