Orthogonal multi-testing

Rémi Coulom · Post by **Rémi Coulom** » Wed Jun 03, 2009 10:16 pm

Zach Wegner wrote:Perhaps I misunderstand, but it seemed to me that the regression was secondary, and the fitting of the model was only to predict the absolute maximum and to determine where to sample. IOW, you are trying to fit the quadratic model for each program parameter independently. But, rereading the note on slide 6 now makes it seem like you aren't. So is the algorithm actually trying to fit {a1,b1,c1,a2,b2,c2,...,ai,bi,ci} rather than {p1,p2,...,pi}?

Also, looking at slide 12, it looks like the model takes into account both parameters. How exactly does that happen? Is there a separate fit of x1 for every value of xi?

I am not sure I understand your notation. In one dimension, the regression is a.x²+b.x+c. In dimension 2, it is a11.x1²+a22.x2²+a12.x1.x2+b1.x1+b2.x2 +c. The term in x1.x2 makes it possible to take correlation into consideration. I don't fit a model for each parameter independently. In the more general n-dimensional case, the quadratic term is x'Ax, whith A a nxn definite negative symmetric matrix.

So, with general quadratic regression, the number of model parameters to be estimated grows like the square of the dimension. It is possible to reduce the complexity by assuming independance between some parameters. It is possible to do this by hand, or by statistical tests of lack of fit that would indicate that it is necessary to take correlation into consideration.

Rémi

Zach Wegner · Post by **Zach Wegner** » Wed Jun 03, 2009 11:38 pm

Rémi Coulom wrote:I am not sure I understand your notation. In one dimension, the regression is a.x²+b.x+c. In dimension 2, it is a11.x1²+a22.x2²+a12.x1.x2+b1.x1+b2.x2 +c. The term in x1.x2 makes it possible to take correlation into consideration. I don't fit a model for each parameter independently. In the more general n-dimensional case, the quadratic term is x'Ax, whith A a nxn definite negative symmetric matrix.

So, with general quadratic regression, the number of model parameters to be estimated grows like the square of the dimension. It is possible to reduce the complexity by assuming independance between some parameters. It is possible to do this by hand, or by statistical tests of lack of fit that would indicate that it is necessary to take correlation into consideration.

Rémi

OK, that makes much more sense, thanks. My notation was assuming a fit on a straight multiply of the quadratic functions (a1x1^2*a2x^2 + a1x1^2*b2x2 + ...). That seems like a much better method though.

Zach

Don · Post by **Don** » Tue Jun 09, 2009 2:35 am

Rémi Coulom wrote:Hi,

I have been working for a few months on a statistical tool for automatic parameter optimization. It includes multi-parameter optimization of continuous and discrete parameters.

This tool is not ready yet, but if you are curious you can read some slides I presented in Japan earlier this year:
http://remi.coulom.free.fr/QLR/QLRSlides.pdf

The orthogonal model (ie, no correlation between parameters) is very effective if the hypothesis is correct. But if you wish to optimize piece values, I expect it is necessary to take correlations into consideration.

I did a lot of bibliographical research, and it turns out that the problem of automatic parameter optimization from paired comparisons (or binomial/trinomial response) has been studied a lot in the past. In particular, the food industry uses these statistical methods to optimize recipes: they ask people which of two recipes they prefer. From this data, they optimize the amounts of ingredients. This problem is formally equivalent to optimizing a game-playing program from game outcomes.

Some relevant references:
Optimal paired comparison designs for factorial experiments, by Emiel van Berkum;

Response Surface Fitting Using a Generalization of the Bradley-Terry Paired-Comparison Model, by A. Springall.

If you can wait a few weeks, I expect to release draft versions of two papers I am currently writing on this topic. I will also release a tool for automatic optimization of xboard and gtp programs.

Rémi

That sounds awesome. I look forward to seeing this papers and the tool too.

Don · Post by **Don** » Thu Jun 18, 2009 6:08 pm

hgm wrote:Suppose you want to evaluate 4 possible changes to your engine, and your target is to play-test them to see if they deliver an improvement of at least 1% (score-wise) with 84% confidence ...

I added orthogonal multi-testing to my arsenal of tools. I already have a pretty sophisticated autotester that I built in C. It knows how to utilize a multiple-core machine, it supports as many programs as I want to configure and each program can run different time controls etc. It is built to be fast and is not graphical.

I think there is one issue that has to be dealt with if you do Orthogonal multi-testing but that depends on how your tester already work. It has to do with how your openings are assigned to the games.

My tester uses a set of almost 4000 openings to a shallow depth of 10 ply. The problem is that the FIRST game is the same for any given pairing - for instance if Joe plays Fred, their first game will be the same as the first game for Jack and Bill. Since the tester does round robin, if you have a lot of contestants you will be seeing only this opening until each player has played both sides against each opponent.

I think that is a flaw with my tester in general and I don't know how others work, but it's especially bad for orthogonal multi-testing because if you are tuning, for instance 7 evaluation terms (which is what I am doing now) you will get a LOT of data pretty quickly for each evaluation term (because of the beauty of orthogonal multi-testing) but it will all be based on a very limited set of openings, so the data will biased.

For instance if the first few openings are double king pawn, you are essentially tuning the evaluation function to play double king pawn openings well unless you are willing to wait (for a very long time) for a huge variety of openings to be played with each combination of players. But the whole idea of multi-testing is an economy of effort, few games need to be played for a given pairing. So even though you might get a lot of data, you lack diversity of openings.

The solution for me was to modify the tester so that the openings look random. Basically I compute a deterministic ordering for each pair, based on a hash of the players. So when Fred plays Joe, they will likely play a different first game than when Horatio plays Phil. But of course they will never repeat the same opening, they will just play them in a different order. Also, even with the same two players the ordering is different if you play white than when you have black. So Joe may play e4 against Fred in the first games as white, but when Fred has white for the first time, he may play d4.

In this way, even with orthogonal mult-testing the first hundred games will represent a huge diversity of opening systems tested with each evaluation term of interest.

I did not want to maintain state between runs so the tester is able to compute in a deterministic way what the opening sequence is for each color/pairing combination. If anyone is interested in the details, it works like a linear congruential random number generator where I take care to ensure that is has a full period (over the range that encompasses the number openings I have) and the seed and characteristics of this generator are deterministic based on a hash of the players names.

bob · Post by **bob** » Thu Jun 18, 2009 6:31 pm

Don wrote:
hgm wrote:Suppose you want to evaluate 4 possible changes to your engine, and your target is to play-test them to see if they deliver an improvement of at least 1% (score-wise) with 84% confidence ...
I added orthogonal multi-testing to my arsenal of tools. I already have a pretty sophisticated autotester that I built in C. It knows how to utilize a multiple-core machine, it supports as many programs as I want to configure and each program can run different time controls etc. It is built to be fast and is not graphical.

I think there is one issue that has to be dealt with if you do Orthogonal multi-testing but that depends on how your tester already work. It has to do with how your openings are assigned to the games.

My tester uses a set of almost 4000 openings to a shallow depth of 10 ply. The problem is that the FIRST game is the same for any given pairing - for instance if Joe plays Fred, their first game will be the same as the first game for Jack and Bill. Since the tester does round robin, if you have a lot of contestants you will be seeing only this opening until each player has played both sides against each opponent.

I think that is a flaw with my tester in general and I don't know how others work, but it's especially bad for orthogonal multi-testing because if you are tuning, for instance 7 evaluation terms (which is what I am doing now) you will get a LOT of data pretty quickly for each evaluation term (because of the beauty of orthogonal multi-testing) but it will all be based on a very limited set of openings, so the data will biased.

For instance if the first few openings are double king pawn, you are essentially tuning the evaluation function to play double king pawn openings well unless you are willing to wait (for a very long time) for a huge variety of openings to be played with each combination of players. But the whole idea of multi-testing is an economy of effort, few games need to be played for a given pairing. So even though you might get a lot of data, you lack diversity of openings.

The solution for me was to modify the tester so that the openings look random. Basically I compute a deterministic ordering for each pair, based on a hash of the players. So when Fred plays Joe, they will likely play a different first game than when Horatio plays Phil. But of course they will never repeat the same opening, they will just play them in a different order. Also, even with the same two players the ordering is different if you play white than when you have black. So Joe may play e4 against Fred in the first games as white, but when Fred has white for the first time, he may play d4.

In this way, even with orthogonal mult-testing the first hundred games will represent a huge diversity of opening systems tested with each evaluation term of interest.

I did not want to maintain state between runs so the tester is able to compute in a deterministic way what the opening sequence is for each color/pairing combination. If anyone is interested in the details, it works like a linear congruential random number generator where I take care to ensure that is has a full period (over the range that encompasses the number openings I have) and the seed and characteristics of this generator are deterministic based on a hash of the players names.

I don't think that is safe. My tester plays each position twice against all opponents. I would not use a different set of opening positions for each test, that invites bias. In my testing script, I can tell the thing how many positions to use. Rather than 4000, if I were doing two tests mixed together,

In a normal test I use about 4,000 positions, vs 4 opponents, double-RR style so I get 32,000 games for a test. If I test A on, then A off, then B on, then B off, that would be 128,000 games. Using the orthogonal approach, I would use 1/2 the positions, so that I have 4000 games with A on B off, 4000 with A on B on, 4000 with A off B on, and 4000 with A off B off. That gives me 8000 games with A on, 8000 games with A off, ditto for B. 1/2 the games of the original test to get the same accuracy. If (a) there are two independent things to test at the same time and (b) they are truly independent.

I'm not a fan of different positions for different tests, how can you be sure that the positions don't introduce some sort of bias where set a is more dependent on one of the changes than set b?

krazyken · Post by **krazyken** » Thu Jun 18, 2009 7:22 pm

Statistically speaking, the set of opening positions you use in your testing needs to be representative of the openings that you will actually be playing. If the set of positions is representative, there should be no bias as to whether you choose to use all positions, or a random sample of those positions for testing.
Using the same conditions repeatedly is something that will introduce bias. Introducing bias is not necessarily a bad thing, using structured testing can provide faster convergence, than random testing, you just have to be careful to check your assumptions to make sure you are testing for the right thing.

Don · Post by **Don** » Thu Jun 18, 2009 8:49 pm

bob wrote: I don't think that is safe. My tester plays each position twice against all opponents. I would not use a different set of opening positions for each test, that invites bias. In my testing script, I can tell the thing how many positions to use. Rather than 4000, if I were doing two tests mixed together,

I suspect that you misunderstand what I'm doing. A given player will play both sides of each opening for each opponent. No opening sequences will ever get repeated.

However, what is different is that when I do a new test, the openings will get played in a different order for each pair of players.

In a normal test I use about 4,000 positions, vs 4 opponents, double-RR style so I get 32,000 games for a test.

That's pretty much a perfect description of how I do it, although I sometimes have more or less than 4 players. With OMT I have 128 players as I test 7 different parameters!

If I test A on, then A off, then B on, then B off, that would be 128,000 games. Using the orthogonal approach, I would use 1/2 the positions, so that I have 4000 games with A on B off, 4000 with A on B on, 4000 with A off B on, and 4000 with A off B off. That gives me 8000 games with A on, 8000 games with A off, ditto for B. 1/2 the games of the original test to get the same accuracy. If (a) there are two independent things to test at the same time and (b) they are truly independent.

Yes, that's exactly how I do it.

I'm not a fan of different positions for different tests, how can you be sure that the positions don't introduce some sort of bias where set a is more dependent on one of the changes than set b?

I think you are not understanding why I am doing it this way. They way I'm doing it is exactly the same as the way you are doing it, if I allow the full set of 4000 positions to be played with each possible matchup.

But what about the case where you stop early? Let's do the math. I have 7 parameter and 128 combinations of parameters. Each player plays each color against each opponent for each opening. So how many games is that for just 1 opening? Each player will play 127 games as white for a given opening so we have 128 * 127 = 16256 games. You would divide that by 2 since each player is really playing half a game. But since you are playing the opening twice, once for white and once for black you have 16256 games for just a single opening.

For any given positional term, half of the players have the term ON and half of their opponents have the term OFF. So I think 1/4 of the games can be use to measure how well a given term works. So for a single positional term you have an effective sample size of 4064 games after you have played just 1 opening!

As you know 4,064 games is not a large enough sample to measure small changes, but it's pretty impressive considering that it was achieved with a single opening! Also, it's at least starting to be enough games to notice major trends or problems.

Since it's impractical for me to play 16256 * 4000 = 65 million plus games to get a result, I am probably going to stop this test early. In fact, I may have enough data to draw some tentative conclusions after only 3 or 4 openings have been played.

But as you yourself point out, do I really want to draw conclusions based on how well some evaluation parameter does using just 4 openings? Even though that is over a 16,000 game sample, it's may not be that valid because those 4 openings may have made the parameter look especially good or bad and I could draw the wrong conclusion.

However, if the starting positions were chosen in a randomly distributed way, after 16,000 games I would know that each positional feature will have been tested with a huge variety of openings. It's not very important that a specific PROGRAM only saw 4 openings against some other specific program (which I think is what you are unduly worried about), but what is much more relevant is that each positional term gets representation over a variety of opening systems.

bob · Post by **bob** » Thu Jun 18, 2009 11:32 pm

krazyken wrote:Statistically speaking, the set of opening positions you use in your testing needs to be representative of the openings that you will actually be playing. If the set of positions is representative, there should be no bias as to whether you choose to use all positions, or a random sample of those positions for testing.
Using the same conditions repeatedly is something that will introduce bias. Introducing bias is not necessarily a bad thing, using structured testing can provide faster convergence, than random testing, you just have to be careful to check your assumptions to make sure you are testing for the right thing.

The problem is that "representative" is impossibly vague. Two similar positions can lead a program to dramatically different endgames where different scoring terms will be important. If you use different subsets of positions, you do exactly the same as if you were using an opening book. Randomness goes up and that's exactly the opposite of what we want. I certainly do not want to test A on vs A off using two different sets of positions.

bob · Post by **bob** » Thu Jun 18, 2009 11:36 pm

Don wrote:
bob wrote: I don't think that is safe. My tester plays each position twice against all opponents. I would not use a different set of opening positions for each test, that invites bias. In my testing script, I can tell the thing how many positions to use. Rather than 4000, if I were doing two tests mixed together,
I suspect that you misunderstand what I'm doing. A given player will play both sides of each opening for each opponent. No opening sequences will ever get repeated.

However, what is different is that when I do a new test, the openings will get played in a different order for each pair of players.

Here's the question. You play some games with "A" on, and some games with "A" off, to determine whether the new idea "A" is any good. Do you play the _same_ positions for both? If not, that is what I disagree with. If so, then I misunderstood what you were saying.

In a normal test I use about 4,000 positions, vs 4 opponents, double-RR style so I get 32,000 games for a test.
That's pretty much a perfect description of how I do it, although I sometimes have more or less than 4 players. With OMT I have 128 players as I test 7 different parameters!

If I test A on, then A off, then B on, then B off, that would be 128,000 games. Using the orthogonal approach, I would use 1/2 the positions, so that I have 4000 games with A on B off, 4000 with A on B on, 4000 with A off B on, and 4000 with A off B off. That gives me 8000 games with A on, 8000 games with A off, ditto for B. 1/2 the games of the original test to get the same accuracy. If (a) there are two independent things to test at the same time and (b) they are truly independent.
Yes, that's exactly how I do it.

I'm not a fan of different positions for different tests, how can you be sure that the positions don't introduce some sort of bias where set a is more dependent on one of the changes than set b?
I think you are not understanding why I am doing it this way. They way I'm doing it is exactly the same as the way you are doing it, if I allow the full set of 4000 positions to be played with each possible matchup.

But what about the case where you stop early?

"stop early"??? I don't do that. I've seen too many cases where something starts off real good, and then drops back to normal or sometimes even worse. And vice-versa. I just run the tests to conclusion and then look at the results. I try not to pay much attention mid-test to avoid drawing a conclusion that might be wrong.

Let's do the math. I have 7 parameter and 128 combinations of parameters. Each player plays each color against each opponent for each opening. So how many games is that for just 1 opening? Each player will play 127 games as white for a given opening so we have 128 * 127 = 16256 games. You would divide that by 2 since each player is really playing half a game. But since you are playing the opening twice, once for white and once for black you have 16256 games for just a single opening.

For any given positional term, half of the players have the term ON and half of their opponents have the term OFF. So I think 1/4 of the games can be use to measure how well a given term works. So for a single positional term you have an effective sample size of 4064 games after you have played just 1 opening!

As you know 4,064 games is not a large enough sample to measure small changes, but it's pretty impressive considering that it was achieved with a single opening! Also, it's at least starting to be enough games to notice major trends or problems.

Since it's impractical for me to play 16256 * 4000 = 65 million plus games to get a result, I am probably going to stop this test early. In fact, I may have enough data to draw some tentative conclusions after only 3 or 4 openings have been played.

But as you yourself point out, do I really want to draw conclusions based on how well some evaluation parameter does using just 4 openings? Even though that is over a 16,000 game sample, it's may not be that valid because those 4 openings may have made the parameter look especially good or bad and I could draw the wrong conclusion.

However, if the starting positions were chosen in a randomly distributed way, after 16,000 games I would know that each positional feature will have been tested with a huge variety of openings. It's not very important that a specific PROGRAM only saw 4 openings against some other specific program (which I think is what you are unduly worried about), but what is much more relevant is that each positional term gets representation over a variety of opening systems.

Don · Post by **Don** » Fri Jun 19, 2009 12:21 am

bob wrote: The problem is that "representative" is impossibly vague. Two similar positions can lead a program to dramatically different endgames where different scoring terms will be important. If you use different subsets of positions, you do exactly the same as if you were using an opening book. Randomness goes up and that's exactly the opposite of what we want. I certainly do not want to test A on vs A off using two different sets of positions.

Ok, so you just want to use 1 set of positions. If you test 7 paramaters like I'm doing that means your "set" of positions and the first 16,526 games of your test consist of a SINGLE opening. That doesn't bother you?

Even by the time I test 100,000 samples, I have only used 6 openings.

As I play more and more games, my testing becomes like yours until it eventually converges to the same exact test you are doing. But I would never trust the early results of your test based on a single opening (or few openings) played over and over again thousands of times.

Let's say I was testing the value of a pawn on e2, just to make up a silly but useful example to illustrate my point. If the first move of my first 6 openings was e3 or e4, then this feature would NEVER be tested until I had played 100,000 games. But with my system I can be sure it's being well exercised even after a few games.

Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing