I am interested in some elaboration. I basically used Monte Carlo to tune the "acceptance" and "rejection" functions. We generally don't expect any single change to be very significant and I would expect about 95% of our changes to fall within 4 ELO in either direction.AlvaroBegue wrote:Here's how I see the problem. Start with a prior distribution of the ELO difference for a proposed change to the engine (people with a lot of experience in automated testing may have a good idea of what this distribution should be).
A testing procedure with early stopping can be described as two function: An acceptance function A(n) which indicates how many points you need to have after n games to accept the change, and a rejection function R(n) which indicates how many points you need to have after n games to reject the change.
Once you have these two ingredients (prior distribution and testing procedure), you can compute an average ELO improvement per game played (a tiny number). If anyone wants some details of how this could be done, I can try to explain it in some detail.
Now that we know what we are maximizing, some sort of optimization algorithm would let us design the optimal testing procedure.
When I ran the sim, I was basically just trying to minimize the number of zero ELO changes I would accept, knowing that a percentage of those would be even weaker. I really want to get 1 ELO regressions down to close to zero - which means I must reject a lot 1 ELO improvements.
This has to be tuned or matched against some expectation of the ratio of changes we actually submit to the tester. For example if we submit 100 changes to the tester and 80 of them are 1/2 ELO regressions, we will probably accept enough of them to continually regress the program. How do you deal with that? One idea is to be very stubborn and have a really high acceptance criteria. Another dea is to classify changes that we submit (which I think you are suggesting here.) If an idea is "solid", it reflects our prior believe that it is likely to be a good change (or at least not likely to be damaging). It could be minor speedup but with very little risk. A more speculative change would require a much stricter acceptance criteria.
