Orthogonal multi-testing

hgm · Post by **hgm** » Thu May 28, 2009 8:52 am

The changes to be tested simultaneously have to be "orthogonal", i.e. they should not be for diffrent changes to the same parameter. If you want to test 10 different parameter values for determining the optmum, you will have to make 10 runs, each with sufficiently many games.

The method can be adapted for doing such tunings for two independent parameters simultaneously, though:

If you want to do 10 gauntlets with different settings 0-9 for parameter A, each gauntlet with 10,000 games (say), you can at the same time do such a test for an independent parameter B: just subdivide the 10,000 games with one setting for A into 10 groups of 1000 games, with 1 different settings for B. Then you wil also have played each setting of B 10,000 times.

MattieShoes · Post by **MattieShoes** » Fri May 29, 2009 4:29 pm

I was thinking one could save time by only doing enough games to determine that a change is bad... All we need to "know" is that it's worse, not exactly how MUCH worse, yes? Once one passes a certain confidence threshold, just give up. Sort of an alpha-beta for testing.

I was thinking that might work well with orthogonal multi-testing as well, since you're most likely just interested in the strongest combination -- One could more intelligently decide which combinations require more testing based on something like LOS matrix in bayeselo.

But I don't know whether that's statistically sound. It just sounded like a neat way to perhaps allow significant testing for those of us without a cluster...

It'd be nifty if there was a program that did something like that automatically.

bob · Post by **bob** » Fri May 29, 2009 5:51 pm

MattieShoes wrote:I was thinking one could save time by only doing enough games to determine that a change is bad... All we need to "know" is that it's worse, not exactly how MUCH worse, yes? Once one passes a certain confidence threshold, just give up. Sort of an alpha-beta for testing.

I was thinking that might work well with orthogonal multi-testing as well, since you're most likely just interested in the strongest combination -- One could more intelligently decide which combinations require more testing based on something like LOS matrix in bayeselo.

But I don't know whether that's statistically sound. It just sounded like a neat way to perhaps allow significant testing for those of us without a cluster... It'd be nifty if there was a program that did something like that automatically.

The idea is good, but it really doesn't apply here very well. we are not talking about changes that are very good or very bad. But changes that are slightly better or slightly worse. The "slightly" is the kller as it takes a ton of games to get the error bars small enough to be able to make that determination...

brianr · Post by **brianr** » Fri May 29, 2009 6:49 pm

bob wrote:
zamar wrote:Excellent!! Sun shines now a bit more brightly! Thank you for pointing this out
But only a bit. Still takes a ton of games.

I'm planning on a quick "validity test" for the idea. I'm going to try tweaking piece values. First a reduced number of games as HG suggested where I tweak rook and queen values independently. Then use his approach to compare just queen tweaks and just rook tweaks. And then run a normal test changing just one at a time to see how well the approaches correlate...

I would think that queen and rook values would likley be far more correlated than some other pairs, like queen and isolated pawns for example.

Bob, Crafty has a vast number of eval terms. I repectfully suggest picking another pair.

Brian

MattieShoes · Post by **MattieShoes** » Sat May 30, 2009 4:21 am

Well the idea was that since you're testing a bunch of small changes with every possible combination, the difference between best and worst would be much more stark. My own idea doesn't sound quite right to me though, since it's using data from the worst engine results in calculating the others.

Perhaps it makes more sense to think of it in terms of the changes rather than the multitude of engines. It might become quickly obvious that mod A is bad, but B, C, and D are still unclear. One could stop at that point and remove all 8 engines with mod A from the test, then continue with 8 engines instead of 16 to test the other 3 mods further, retaining the games you've already run between the engines without mod A.

I don't know that it'd help identify those +3 elo tweaks any faster, but it might save time on those innocent looking change that drop strength 20 points.

Rémi Coulom · Post by **Rémi Coulom** » Wed Jun 03, 2009 6:13 pm

Hi,

I have been working for a few months on a statistical tool for automatic parameter optimization. It includes multi-parameter optimization of continuous and discrete parameters.

This tool is not ready yet, but if you are curious you can read some slides I presented in Japan earlier this year:
http://remi.coulom.free.fr/QLR/QLRSlides.pdf

The orthogonal model (ie, no correlation between parameters) is very effective if the hypothesis is correct. But if you wish to optimize piece values, I expect it is necessary to take correlations into consideration.

I did a lot of bibliographical research, and it turns out that the problem of automatic parameter optimization from paired comparisons (or binomial/trinomial response) has been studied a lot in the past. In particular, the food industry uses these statistical methods to optimize recipes: they ask people which of two recipes they prefer. From this data, they optimize the amounts of ingredients. This problem is formally equivalent to optimizing a game-playing program from game outcomes.

Some relevant references:

If you can wait a few weeks, I expect to release draft versions of two papers I am currently writing on this topic. I will also release a tool for automatic optimization of xboard and gtp programs.

Rémi

bob · Post by **bob** » Wed Jun 03, 2009 6:49 pm

brianr wrote:
bob wrote:
zamar wrote:Excellent!! Sun shines now a bit more brightly! Thank you for pointing this out
But only a bit. Still takes a ton of games.

I'm planning on a quick "validity test" for the idea. I'm going to try tweaking piece values. First a reduced number of games as HG suggested where I tweak rook and queen values independently. Then use his approach to compare just queen tweaks and just rook tweaks. And then run a normal test changing just one at a time to see how well the approaches correlate...
I would think that queen and rook values would likley be far more correlated than some other pairs, like queen and isolated pawns for example.

Bob, Crafty has a vast number of eval terms. I repectfully suggest picking another pair.

Brian

I actually believe that the entire idea might be flawed with respect to chess. The entire evaluation is correlated to produce the best chess moves possible. It is very difficult to pick out any two terms that do not interact in a positive way, until you venture into odd cases where you compare a single eval term to a single search term, where the search is more concerned with winning material or forcing mate, while the evaluation is more concerned with optimal piece placement. And those are still correlated in a way when you think about it...

Zach Wegner · Post by **Zach Wegner** » Wed Jun 03, 2009 7:16 pm

Rémi Coulom wrote:Hi,

I have been working for a few months on a statistical tool for automatic parameter optimization. It includes multi-parameter optimization of continuous and discrete parameters.

This tool is not ready yet, but if you are curious you can read some slides I presented in Japan earlier this year:
http://remi.coulom.free.fr/QLR/QLRSlides.pdf

The orthogonal model (ie, no correlation between parameters) is very effective if the hypothesis is correct. But if you wish to optimize piece values, I expect it is necessary to take correlations into consideration.

I did a lot of bibliographical research, and it turns out that the problem of automatic parameter optimization from paired comparisons (or binomial/trinomial response) has been studied a lot in the past. In particular, the food industry uses these statistical methods to optimize recipes: they ask people which of two recipes they prefer. From this data, they optimize the amounts of ingredients. This problem is formally equivalent to optimizing a game-playing program from game outcomes.

Some relevant references:
Optimal paired comparison designs for factorial experiments, by Emiel van Berkum;

Response Surface Fitting Using a Generalization of the Bradley-Terry Paired-Comparison Model, by A. Springall.

If you can wait a few weeks, I expect to release draft versions of two papers I am currently writing on this topic. I will also release a tool for automatic optimization of xboard and gtp programs.

Rémi

Very interesting. It's too bad that I've spent a lot of time working on my own system to do something along the same lines. OTOH the framework of parameters is very much unique to my program, so I have a bit more flexibility. Just looking at the slides is giving me ideas, so your work will most likely influence mine. I look forward to any tools you come out with, though it might reduce any competitive advantage I have been building up.

Also, it appears that assuming quadratic models for the winning probabilities given a parameter is unnecessary, or is at least only there because of computational concerns. Wouldn't any n-degree polynomial be just about as easy to use for the model?

Rémi Coulom · Post by **Rémi Coulom** » Wed Jun 03, 2009 7:33 pm

Zach Wegner wrote:Also, it appears that assuming quadratic models for the winning probabilities given a parameter is unnecessary, or is at least only there because of computational concerns. Wouldn't any n-degree polynomial be just about as easy to use for the model?

The main problems with n-degree polynomials are that they may have local optima, and they don't scale well to high dimensions. It seems reasonable to assume that there is no local optima. I nevered observed any in my experience of parameter optimization. Also, I believe it is important to be able to optimize many parameters at the same time. It is always better than optimizing parameters one by one.

I thought about ways to have a more general model. Maybe a non-linear monotonous transformation of each parameter could help. That is to say, use x_i^a_i instead of x_i. A quadratic regression would still have only one maximum, and it adds only one parameter to the model for each variable.

Rémi

Zach Wegner · Post by **Zach Wegner** » Wed Jun 03, 2009 8:33 pm

Rémi Coulom wrote:The main problems with n-degree polynomials are that they may have local optima, and they don't scale well to high dimensions. It seems reasonable to assume that there is no local optima. I nevered observed any in my experience of parameter optimization. Also, I believe it is important to be able to optimize many parameters at the same time. It is always better than optimizing parameters one by one.

I thought about ways to have a more general model. Maybe a non-linear monotonous transformation of each parameter could help. That is to say, use x_i^a_i instead of x_i. A quadratic regression would still have only one maximum, and it adds only one parameter to the model for each variable.

Rémi

Perhaps I misunderstand, but it seemed to me that the regression was secondary, and the fitting of the model was only to predict the absolute maximum and to determine where to sample. IOW, you are trying to fit the quadratic model for each program parameter independently. But, rereading the note on slide 6 now makes it seem like you aren't. So is the algorithm actually trying to fit {a1,b1,c1,a2,b2,c2,...,ai,bi,ci} rather than {p1,p2,...,pi}?

Also, looking at slide 12, it looks like the model takes into account both parameters. How exactly does that happen? Is there a separate fit of x1 for every value of xi?

Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing