Orthogonal multi-testing

hgm · Post by **hgm** » Fri Apr 10, 2009 3:47 pm

Suppose you want to evaluate 4 possible changes to your engine, and your target is to play-test them to see if they deliver an improvement of at least 1% (score-wise) with 84% confidence. To get a score error of 1%, you need 1600 games in your standard gauntlet (40%/sqrt(1600)=1%). But you need to base your verdict on the difference of two gauntlets (with and without the change), and the statistical error in that is sqrt(2) as large as in the individual results. So you need 2 times as many games in each gauntlet, i.e. two gauntlets of 3200 games, one with the change, one without the change.

Now if you want to evaluate all four changes, you would have to repeat that 4 times. Well, would you really? It is enticing to reuse the result of the gauntlet without the change as a reference, and compare the result of the four 3200-game gauntlets with the changes to it, so that you do 5 x 3200 games in total. But this is very risky: if you happened to have a fluke in your reference gauntlet, so that it scored higher than it normally would, you would most likely reject every change you test for the rest of your life (or at least as long as you continue to compare to that reference, without re-doing it). If you are going to use a result multiple times, you better have a very high confidence in it.

So if you require the reference to have only half the error of the individual tests, it needs 4 times as many games as those. But in that case your acceptation or rejection of the change depends on the difference of two results with an unequal error. You have to add 'according to Pythagoras', i.e. where two errors of 1% add to sqrt(2) = 1.4%, an error of 1% and 0.5% only add to sqrt(1*1 + 0.5*0.5) = sqrt(1.25) = 1.12%. So in stead of 2 times as many games in the gauntlets (compared to 1600) you would need only 1.25 times as many games to reduce the error in the difference back to 1%. So you could do four gauntlets of 2000 games with each of the changes, and one of 8000 games as a reference, or 16000 games in total.

That happens to be the same number of games as when you would have naively used five 3200-game gauntless, using the reference gauntlet 4 times, but much less sensitive for a total disaster due to a bad reference. You could get the same 1% error with even fewer games if you would not determine the reference with 4 times the number of games, but with 2 times. The error in the reference would then be sqrt(0.5) times that of a single-length gauntlet, and the error in the differences would be sqrt(1.5) times as large as the latter, requiring 1.5 times as many games to push it back to 1% (2400 games). But you would then only have to do 4 x 2400 + 1 x 4800 = 14400 games in total, not 16000. But the risk of correlated rejection of all changes due to a reference fluke would again be a bit larger.

Now this is really the minimum number of games that is needed, right? Well, actually: wrong!

A more efficient way of testing would prepare all possible combinations of the four changes enabled or disabled. That is 2^4 = 16 engines. You let each engine play the same gauntlet of 400 games. That is only 6400 games in total. Now for any of the four changes there are 8 engines that had the change, and 8 that had not. So you have played 3200 games with the change enabled, and 3200 games with that change disabled. The statistical error in each of these 3200-game totalled results is sqrt(0.5)% (=40%/sqrt(3200)), the statistical error in the difference is 1%, as desired. The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

With this method you test 4 changes to the same accuracy with the same number of games as you would have tested a single change! In other words, you tested 3 changes for free! And it does not stop there: you could just as easily have evaluated 6 changes to 1% accuracy in those 6400 games, by making 64 versions of your engine and have each play 100-game gauntlets.

This really is a big time saver!

bob · Post by **bob** » Fri Apr 10, 2009 9:11 pm

hgm wrote:Suppose you want to evaluate 4 possible changes to your engine, and your target is to play-test them to see if they deliver an improvement of at least 1% (score-wise) with 84% confidence. To get a score error of 1%, you need 1600 games in your standard gauntlet (40%/sqrt(1600)=1%). But you need to base your verdict on the difference of two gauntlets (with and without the change), and the statistical error in that is sqrt(2) as large as in the individual results. So you need 2 times as many games in each gauntlet, i.e. two gauntlets of 3200 games, one with the change, one without the change.

Now if you want to evaluate all four changes, you would have to repeat that 4 times. Well, would you really? It is enticing to reuse the result of the gauntlet without the change as a reference, and compare the result of the four 3200-game gauntlets with the changes to it, so that you do 5 x 3200 games in total. But this is very risky: if you happened to have a fluke in your reference gauntlet, so that it scored higher than it normally would, you would most likely reject every change you test for the rest of your life (or at least as long as you continue to compare to that reference, without re-doing it). If you are going to use a result multiple times, you better have a very high confidence in it.

So if you require the reference to have only half the error of the individual tests, it needs 4 times as many games as those. But in that case your acceptation or rejection of the change depends on the difference of two results with an unequal error. You have to add 'according to Pythagoras', i.e. where two errors of 1% add to sqrt(2) = 1.4%, an error of 1% and 0.5% only add to sqrt(1*1 + 0.5*0.5) = sqrt(1.25) = 1.12%. So in stead of 2 times as many games in the gauntlets (compared to 1600) you would need only 1.25 times as many games to reduce the error in the difference back to 1%. So you could do four gauntlets of 2000 games with each of the changes, and one of 8000 games as a reference, or 16000 games in total.

That happens to be the same number of games as when you would have naively used five 3200-game gauntless, using the reference gauntlet 4 times, but much less sensitive for a total disaster due to a bad reference. You could get the same 1% error with even fewer games if you would not determine the reference with 4 times the number of games, but with 2 times. The error in the reference would then be sqrt(0.5) times that of a single-length gauntlet, and the error in the differences would be sqrt(1.5) times as large as the latter, requiring 1.5 times as many games to push it back to 1% (2400 games). But you would then only have to do 4 x 2400 + 1 x 4800 = 14400 games in total, not 16000. But the risk of correlated rejection of all changes due to a reference fluke would again be a bit larger.

Now this is really the minimum number of games that is needed, right? Well, actually: wrong!

A more efficient way of testing would prepare all possible combinations of the four changes enabled or disabled. That is 2^4 = 16 engines. You let each engine play the same gauntlet of 400 games. That is only 6400 games in total. Now for any of the four changes there are 8 engines that had the change, and 8 that had not. So you have played 3200 games with the change enabled, and 3200 games with that change disabled. The statistical error in each of these 3200-game totalled results is sqrt(0.5)% (=40%/sqrt(3200)), the statistical error in the difference is 1%, as desired. The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

With this method you test 4 changes to the same accuracy with the same number of games as you would have tested a single change! In other words, you tested 3 changes for free! And it does not stop there: you could just as easily have evaluated 6 changes to 1% accuracy in those 6400 games, by making 64 versions of your engine and have each play 100-game gauntlets.

This really is a big time saver!

At first, this seemed to be wrong, but after a little thought, it begins to make sense. I'll give this a whirl when I get a chance and report back. I can probably find recent versions where I changed 4 different things but tested each separately. That would give me a control, and then I could try this approach as well and see how closely they agree...

more later...

nevatre · Post by **nevatre** » Fri Apr 10, 2009 9:22 pm

Yes and you can, with the same experiment, measure interactions between different changes.

There are many methods from the theory of experimental design that can make your experiments/tests more efficient.

wgarvin · Post by **wgarvin** » Fri Apr 10, 2009 10:13 pm

hgm wrote:The fact that not all of the 3200 games were obtained with the same setting of the other changes does not affect the outcome (if the changes are independent and lead to additive improvement), as the reference had exactly the same mix of settings of the other changes. This is similar to testing against a number of different opponents to get more independent games, you can just as well play with a few different versions of the engine-under-test to create more diversity.

Isn't that a really big assumption though, that the changes are actually independent and lead to additive improvement? If your 4 chosen changes are not actually independent, how would that affect the accuracy of the testing? (Can you detect this from the results?)

hgm · Post by **hgm** » Fri Apr 10, 2009 10:45 pm

Well, one thing is for sure: you would certainly not detect it if you would test all the changes in isolaton, in separate 3200-game gauntlets. The orthogonal multi-testing does allow you to extract some clues about it. If you want to know if change A and change B interact, you would divide your 6400 games in 4 groups of 1600: tose that have both A and B, those that have A but not B, those that have B but not A, and those that have neither. (In each groups there would still be a variety w.r.t. the presence of change C, D etc.)

You can now determine the result in each group with a sigma of 1%, and you would want to test if the average score from A+B and 'neither' would be equal to the average score of A-but-not-B and B-but-not-A. This is again the difference of the result from two gauntlets of 3200, and this difference would have have a 1-sigma error of 1%. So the accuracy with which you decide if the changes interact is the same as the accuracy with which you determine if the average effect of the change (averaged over all combinations with other changes) is positive. From the same data set that you already have.

MattieShoes · Post by **MattieShoes** » Sat Apr 11, 2009 3:24 am

This sounds exactly like what I was thinking, only more well-thought-out

zamar · Post by **zamar** » Sat Apr 11, 2009 4:36 pm

Excellent!! Sun shines now a bit more brightly! Thank you for pointing this out

bob · Post by **bob** » Sat Apr 11, 2009 5:29 pm

zamar wrote:Excellent!! Sun shines now a bit more brightly! Thank you for pointing this out

But only a bit.

Still takes a ton of games.

I'm planning on a quick "validity test" for the idea. I'm going to try tweaking piece values. First a reduced number of games as HG suggested where I tweak rook and queen values independently. Then use his approach to compare just queen tweaks and just rook tweaks. And then run a normal test changing just one at a time to see how well the approaches correlate...

rjgibert · Post by **rjgibert** » Thu May 28, 2009 3:10 am

If I understand it correctly, I think this idea is intended for changes where there are only 2 choices with each change i.e. make the change or not. With a change that takes on a range of values, I don't see how this method can be practical unless the range of values to be tested is fairly small.

bob · Post by **bob** » Thu May 28, 2009 3:16 am

rjgibert wrote:If I understand it correctly, I think this idea is intended for changes where there are only 2 choices with each change i.e. make the change or not. With a change that takes on a range of values, I don't see how this method can be practical unless the range of values to be tested is fairly small.

That's correct. But it has to be fairly small in any case, for n choices, you need n*huge number of games...

Orthogonal multi-testing

Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing

Re: Orthogonal multi-testing