Orthogonal multi-testing

krazyken · Post by **krazyken** » Fri Jun 19, 2009 12:34 am

bob wrote:
krazyken wrote:Statistically speaking, the set of opening positions you use in your testing needs to be representative of the openings that you will actually be playing. If the set of positions is representative, there should be no bias as to whether you choose to use all positions, or a random sample of those positions for testing.
Using the same conditions repeatedly is something that will introduce bias. Introducing bias is not necessarily a bad thing, using structured testing can provide faster convergence, than random testing, you just have to be careful to check your assumptions to make sure you are testing for the right thing.
The problem is that "representative" is impossibly vague. Two similar positions can lead a program to dramatically different endgames where different scoring terms will be important. If you use different subsets of positions, you do exactly the same as if you were using an opening book. Randomness goes up and that's exactly the opposite of what we want. I certainly do not want to test A on vs A off using two different sets of positions.

Randomness is not a bad thing. Many of the statistical functions you use evaluate changes are based on assumptions of randomness of data. If you set up a run with the same positions and opponents, where the opening position, opponent, and color are all randomly chosen for each game and still did 32000 games you will be able to come to the same conclusions with the same accuracy level as one of your normal runs. The bias for both sets will be the same: the selection of opponents and positions to choose from.

Rémi Coulom · Post by **Rémi Coulom** » Fri Jun 19, 2009 7:33 am

krazyken wrote:Randomness is not a bad thing. Many of the statistical functions you use evaluate changes are based on assumptions of randomness of data. If you set up a run with the same positions and opponents, where the opening position, opponent, and color are all randomly chosen for each game and still did 32000 games you will be able to come to the same conclusions with the same accuracy level as one of your normal runs. The bias for both sets will be the same: the selection of opponents and positions to choose from.

I agree that bias will be the same, but variance may be bigger. It may add noise if there is a big difference between the playing strengths of opponents.

Rémi

nevatre · Post by **nevatre** » Fri Jun 19, 2009 8:08 am

Don,

did you know it is possible to test 7 evaluation terms with fewer than 128 combinations of parameters if you adopt a fractional factorial design or other designs, such as Plackett and Burman designs? For example, a quarter fraction would allow you to estimate the effects of all your parameters, and low order interactions between them, with 32 combinations.

With the 128 combinations you are doing enough testing to accurately estimate a lot of higher order interaction terms, and those terms are often negligible. For example, with the 128 combinations you are doing work to estimate one 7th order term,7 6th order, 21 5th order, 35 4th order, and so on. It is very likely (unless your program is already close to optimal) that those terms are close to zero. With a quarter fraction (32 combinations), you can do enough to accurately estimate the effects of your 7 evaluation terms and their 2nd and 3rd order interactions.

Don · Post by **Don** » Fri Jun 19, 2009 1:40 pm

nevatre wrote:Don,

did you know it is possible to test 7 evaluation terms with fewer than 128 combinations of parameters if you adopt a fractional factorial design or other designs, such as Plackett and Burman designs? For example, a quarter fraction would allow you to estimate the effects of all your parameters, and low order interactions between them, with 32 combinations.

With the 128 combinations you are doing enough testing to accurately estimate a lot of higher order interaction terms, and those terms are often negligible. For example, with the 128 combinations you are doing work to estimate one 7th order term,7 6th order, 21 5th order, 35 4th order, and so on. It is very likely (unless your program is already close to optimal) that those terms are close to zero. With a quarter fraction (32 combinations), you can do enough to accurately estimate the effects of your 7 evaluation terms and their 2nd and 3rd order interactions.

It sounds interesting. Are there good papers on it - I'll see what I can find.

nevatre · Post by **nevatre** » Fri Jun 19, 2009 4:47 pm

Not papers, but a good reference is"Statistics for Experimenters" Box, Hunter and Hunter.

However, I am having second thoughts about fractions saving effort in chess testing - because I think the total number of games needs to stay the same, in order to estimate the parameters with the same precision. With a fraction there will be fewer experimental runs, but more games (starting positions) per run to keep the total number of games the same. It might still be worth attempting in order to get a larger sample of starting positions.

bob · Post by **bob** » Fri Jun 19, 2009 5:02 pm

MattieShoes wrote:Well the idea was that since you're testing a bunch of small changes with every possible combination, the difference between best and worst would be much more stark. My own idea doesn't sound quite right to me though, since it's using data from the worst engine results in calculating the others.

Perhaps it makes more sense to think of it in terms of the changes rather than the multitude of engines. It might become quickly obvious that mod A is bad, but B, C, and D are still unclear. One could stop at that point and remove all 8 engines with mod A from the test, then continue with 8 engines instead of 16 to test the other 3 mods further, retaining the games you've already run between the engines without mod A.

I don't know that it'd help identify those +3 elo tweaks any faster, but it might save time on those innocent looking change that drop strength 20 points.

In the past year of testing, no single feature has accounted for more than 10 Elo in change. Most are much smaller, from 1-2 to 4-5 at best. To measure a change of 4-5 Elo, you need well beyond 50K games. The error bar for 32,000 games is +/-5. If the changes were big, things would be different. Unfortunately almost all are not big...

bob · Post by **bob** » Fri Jun 19, 2009 5:07 pm

krazyken wrote:
bob wrote:
krazyken wrote:Statistically speaking, the set of opening positions you use in your testing needs to be representative of the openings that you will actually be playing. If the set of positions is representative, there should be no bias as to whether you choose to use all positions, or a random sample of those positions for testing.
Using the same conditions repeatedly is something that will introduce bias. Introducing bias is not necessarily a bad thing, using structured testing can provide faster convergence, than random testing, you just have to be careful to check your assumptions to make sure you are testing for the right thing.
The problem is that "representative" is impossibly vague. Two similar positions can lead a program to dramatically different endgames where different scoring terms will be important. If you use different subsets of positions, you do exactly the same as if you were using an opening book. Randomness goes up and that's exactly the opposite of what we want. I certainly do not want to test A on vs A off using two different sets of positions.
Randomness is not a bad thing. Many of the statistical functions you use evaluate changes are based on assumptions of randomness of data. If you set up a run with the same positions and opponents, where the opening position, opponent, and color are all randomly chosen for each game and still did 32000 games you will be able to come to the same conclusions with the same accuracy level as one of your normal runs. The bias for both sets will be the same: the selection of opponents and positions to choose from.

If it were truly random, I would agree. But if set A tends toward result A', while set B tends toward result B', you have a problem. With many opponents a small set of starting positions will produce plenty of games, but the performance in A' can be significantly different than the performance in B', solely because of the positions chosen There is already a ton of randomness in how programs play, because of the timing issue I have discussed many times here. I feel like the more stable everything is, run-to-run, the better able I am to measure the effect of the change I am working on.

krazyken · Post by **krazyken** » Fri Jun 19, 2009 5:34 pm

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:Statistically speaking, the set of opening positions you use in your testing needs to be representative of the openings that you will actually be playing. If the set of positions is representative, there should be no bias as to whether you choose to use all positions, or a random sample of those positions for testing.
Using the same conditions repeatedly is something that will introduce bias. Introducing bias is not necessarily a bad thing, using structured testing can provide faster convergence, than random testing, you just have to be careful to check your assumptions to make sure you are testing for the right thing.
The problem is that "representative" is impossibly vague. Two similar positions can lead a program to dramatically different endgames where different scoring terms will be important. If you use different subsets of positions, you do exactly the same as if you were using an opening book. Randomness goes up and that's exactly the opposite of what we want. I certainly do not want to test A on vs A off using two different sets of positions.
Randomness is not a bad thing. Many of the statistical functions you use evaluate changes are based on assumptions of randomness of data. If you set up a run with the same positions and opponents, where the opening position, opponent, and color are all randomly chosen for each game and still did 32000 games you will be able to come to the same conclusions with the same accuracy level as one of your normal runs. The bias for both sets will be the same: the selection of opponents and positions to choose from.
If it were truly random, I would agree. But if set A tends toward result A', while set B tends toward result B', you have a problem. With many opponents a small set of starting positions will produce plenty of games, but the performance in A' can be significantly different than the performance in B', solely because of the positions chosen There is already a ton of randomness in how programs play, because of the timing issue I have discussed many times here. I feel like the more stable everything is, run-to-run, the better able I am to measure the effect of the change I am working on.

Yeah they way you are working is fine, large numbers work. increasing the pool of opponents is about the only place you have left to go, but as you have said there are so few that fit your needs. Truth is if you design your tests correctly you can get your conclusions in far fewer games than what is required for random trials, but that requires far more preparation work.

bob · Post by **bob** » Fri Jun 19, 2009 6:09 pm

Don wrote:
bob wrote: The problem is that "representative" is impossibly vague. Two similar positions can lead a program to dramatically different endgames where different scoring terms will be important. If you use different subsets of positions, you do exactly the same as if you were using an opening book. Randomness goes up and that's exactly the opposite of what we want. I certainly do not want to test A on vs A off using two different sets of positions.
Ok, so you just want to use 1 set of positions. If you test 7 paramaters like I'm doing that means your "set" of positions and the first 16,526 games of your test consist of a SINGLE opening. That doesn't bother you?

No, because I never do that. If I am going to restrict something to play fewer games, as when I want to test very long time controls, I would reduce number of opponents, rather than starting number of positions. Or, I would use a special set of positions I made that has fewer starting positions, but still uniformly spread across all openings played in my huge PGN collection. For the record, here's how I created my opening position set.

1. I had crafty read thru several million games in a PGN collection. At move 11 for white, it spits out the appropriate FEN for that position. I end up with several million FEN positions. I then sort them, and uniq -C them which gives me each unique FEN along with the number of times it occurred. I sort again on this counter, so that I have a set of positions from most frequent to least frequent. I then take the first N of those (in my case, I used some threshold I don't recall such as "keep all positions with a counter >= 4). I can make a smaller set that is still representative by increasing the 4 to something bigger, if needed. But for all my testing, I use the same set of positions and same set of opponents.

If I want to run very long games, I do as mentioned above to give me a smaller set of more frequently played positions, re-calibrate the control version against those positions and opponents, then start testing the modified versions.

I never test with a single opening. In fact, I never test with 500 different positions. The smallest I have used is 1,000 of the most popular positions. I don't terminate tests early because I see way too many odd cases where after 1,000 games the new version looks horrible (or great) , while after 32,000 games it is slightly better (or worse)..

Even by the time I test 100,000 samples, I have only used 6 openings.

Again, I do not test like this. I run each change as a separate trial and play all games with version X before going on to version Y and version Z. I'd rather get an accurate assessment of X as soon as possible, rather than getting good results for X, Y and Z at the same time, but requiring 3x as long to do so.

As I play more and more games, my testing becomes like yours until it eventually converges to the same exact test you are doing. But I would never trust the early results of your test based on a single opening (or few openings) played over and over again thousands of times.

I would never trust the early results of _any_ test, so I assume we agree there.

Let's say I was testing the value of a pawn on e2, just to make up a silly but useful example to illustrate my point. If the first move of my first 6 openings was e3 or e4, then this feature would NEVER be tested until I had played 100,000 games. But with my system I can be sure it's being well exercised even after a few games.

But you have the same problem. Most of your games don't show that effect, and drown out the results from the 7th position until you use enough of those to outweigh the wild variance a few games will show.

Don · Post by **Don** » Fri Jun 19, 2009 7:13 pm

Even by the time I test 100,000 samples, I have only used 6 openings.
Again, I do not test like this. I run each change as a separate trial and play all games with version X before going on to version Y and version Z. I'd rather get an accurate assessment of X as soon as possible, rather than getting good results for X, Y and Z at the same time, but requiring 3x as long to do so.

I'm speaking only of orthogonal mult-testing. And what you are claiming makes no sense. You say you run each change as a separate trial, but HALF your players have that change and half do not. So you are still talking about an enormous time and while you are doing this you are actually testing ALL the changes together.

Now if you are not talking about multi-testing, then you are on the wrong thread.

Let's make sure we understand HG's multi-testing idea - perhaps we are speaking a different language here.

The idea is that you have N things to test which are either ON or OFF. In my case N is 7. So I create 128 individual players in every possible combination. For any given feature being tested half the players have the change ON and half have the change OFF. In my implemention I just label the players with a binary code where each bit is 0 or 1 representing that the feature is ON or OFF. So player 0001001 has feature 0 on and feature 4 on and everything else off. 64 of the players have feature 4 on and 64 have feature 4 off.

I hope H.G. Mueller can comment on whether this understand is correct.

Now how do you test only one feature at a time when 64 players have that feature? And how do you test it without testing the other features? I think that whatever you are doing is not mult-testing or else what I am doing is not mult-testing.

As I play more and more games, my testing becomes like yours until it eventually converges to the same exact test you are doing. But I would never trust the early results of your test based on a single opening (or few openings) played over and over again thousands of times.
I would never trust the early results of _any_ test, so I assume we agree there.

That sounds real noble, but whether the result is early or not is totally arbitrary. It's whatever you have decided in advance is adequate. I think YOUR testing is truncated early if you define a complete test by my standards which is over 60 million games and would take me 16 months to urn on a fast quad computer. When I "stop early" it means I do not do the full 16 month run (it's more like 24 hours or something like that.)

So do you think I am wrong not to run each test to 16 months?

I think the only thing that's important is whether you run the test to adequate confidence, and you are not choosing your stopping point to conform to your belief system, i.e. "I like the results, so I think I'll stop before it changes!"

Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing