margin of error

Rémi Coulom · Post by **Rémi Coulom** » Sat Oct 06, 2012 7:58 pm

Don wrote:
Rémi Coulom wrote:That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.

Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.

Rémi
I did exactly that and came up with something pretty useful. One finding that should be pretty obvious when you think about it but that is surprising if you don't, is illustrated by the following thought experiment:

Assume that you 50% of your experiments are regressions, and 50% are improvements and that in total this is a zero sum game (the total ELO summing the regressions and improvements come out to zero.) Also, assume that you can generate these experiments at any rate of speed desired (in other words, no setup time between experiments.) What is the ideal number of games per match to run to determine whether to keep a change or not and then move on to the next experiment? The answer is non-intuitive, at least to me.

One?

Rémi

Don · Post by **Don** » Sat Oct 06, 2012 8:30 pm

Rémi Coulom wrote:
Don wrote:
Rémi Coulom wrote:That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.

Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.

Rémi
I did exactly that and came up with something pretty useful. One finding that should be pretty obvious when you think about it but that is surprising if you don't, is illustrated by the following thought experiment:

Assume that you 50% of your experiments are regressions, and 50% are improvements and that in total this is a zero sum game (the total ELO summing the regressions and improvements come out to zero.) Also, assume that you can generate these experiments at any rate of speed desired (in other words, no setup time between experiments.) What is the ideal number of games per match to run to determine whether to keep a change or not and then move on to the next experiment? The answer is non-intuitive, at least to me.
One?

Rémi

Yes! It was one of those things that are not immediately intuitive but once you think about it, yes, it should be obvious.

But when the number of regressions is more than 50%, which is probably the case for most of us, the number of games required to resolve them goes up substantially. It's almost depressing, but if, for example, 10% of your changes are actual improvements and they are all difficult to measure (small improvements or regressions) then unless you are very carefully you are going to be overwhelmed by noise and accept more regressions that improvements. Even if 1 out of 3 experiments are tiny improvements you have to play tens of the thousands of games.

The conclusion is that if you are at the point where improvements are pretty small and so are the regressions, the percentage of good vs bad experiments is a huge factor in the equation. If improvement are few and far between then 100,000 games is probably not enough. The only shortuct to testing is to be willing to throw out changes that very well may be improvements. In other words to avoid micro-regressions you must throw out anything that is not a clear improvement unless you are willing to put in the testing resources.

So what we did was design a "wald-like" system based on the simulation I did. The idea is to spend more resources when you need it to resolve the changes and the only way I know to attach real percentages to the numbers I choose is to run simulations as you suggested.

So we have a fixed number of openings - and the test runs to completion unless we hit a stopping rule. The stopping rule is that we stop if we are behind by X number of games or that we are ahead by Y number of games. It turns out that we can minimize resources even more by modifying X and Y as the games progress but not by not by too much. So Y is reduce linearly to 80% of it's original value - in other words if we start at 100 it will be 80 when the simulation ends. I'm not sure of the math principle behind this but it clearly produces better results with less effort on average. I am sure that all I am doing is approximating the (more correct) math behind this. If the test runs to completion without the test being stopped we don't keep the results.

One thing I forgot to mention is that we also generated draws using the actual statistics that our equivalents tests are generating. I think the number change when the draw ratio changes.

The interesting part of this is that by using statistics from the simulations I can tell you what the percentage of false positives or negatives exist in our samples and how many small regressions we are likely to keep as well as how many improvements (of a given magnitude) we are likely to discard. A useful number is also what percentage of neutral changes are kept. My thinking is that this number should be low - for example the straightforward "play 50,000 games and keep it if it score above 50%" will keep a lot of tiny regressions. So if you are not throwing away the really close results you are keeping a lot of regressions.

This comes down to how many games are you will to play to decide whether to keep a change and how much risk are you willing to accept. Regardless of your answer you should at least attempt to maximize your resources.

We are not actually using this right now, the simulations are based only on 2 player matches - using the model of testing one change against the previous best version. That's not how we usually test. But it was a very interesting experiment.

Don · Post by **Don** » Wed Oct 10, 2012 4:09 pm

Ajedrecista wrote: I am not an erudite or a bright person on the issue, but I just try to add my grain of salt. My humble tip is that setting a LOS threshold can be double-edged with few number of games, specially if this threshold is not very high. I also think that you must put a double threshold (for example LOS < 2% or LOS > 98%), just in case the version you expected to perform better finally performs worse.

Probability and Statistics is not my forte either - I have to figure things out as I go and stumble around a bit before coming to the right conclusions. But here is what I know about this:

The LOS should never be interpreted while a test is in progress and neither should the error margins. For example we have noticed MANY times in our testing that some version gains a lead or falls behind more than stated error margin. The meaning that is assigned to the error margins and the LOS is based on the assumption that the test will run to completion, not what may happen in between. If you watch the margins go up and down you are essentially giving yourself many chances to see these anomalies. So you have to pick just ONE observation point - test completion time. It should not surprise you that if you make 100 different observations while the test runs, some of them will look odd.

I see that self tests are not the custom of Komodo development (I think that a variety of engines is better) but disgracefully I can not write anything about non-head-to-head matches and I wish that you will find better help from other forum members.

Regards from Spain.

Ajedrecista.

The jury is still out on this. You are talking about the so-called transitivity between opponents. If you look up into the sky on one of those days where the sky is blue but there are puffy white clouds, you can see all sorts of shapes, animals and peoples are other objects. But these are random structures. There have been times where I swear I have seen intransivity between players, but all of this is highly subject to statistical noise. What I think I am seeing may not be what I am actually seeing.

So our current thinking on this (always subject to change of course) is that self-testing exaggerates the differences between players - which is usually a good thing as it makes it easier to see improvements. Testing against other opponents protects us against in-transitivities. But we don't even know if that is a serious concern or not so we do it partly out of superstition. We have no Iron clad evidence that we should.

Yes, you can construct with thought experiments, scenario's that work against transitivity, where some change makes your program play better against itself, but worse against foreign programs. But we are interested only in what works for us in a practical sense as we are pragmatists. But in practice I don't remember ever seeing a result that proves intransitivity beyond statistical uncertainty.

Another issue here is that playing against a couple of other foreign programs does not really address this issue much. Even playing against other programs is basically incestuous - all programs play the same in the big scheme - nothing like the way people play. So maybe this is like marrying your cousin instead of your sister, it still does not provide a whole lot of diversity of testing.

Still, we test against foreign opponents primarily out of superstition and the assumption that it is the right thing to do.

margin of error

Re: margin of error

Re: margin of error

Re: Margin of error.