Orthogonal multi-testing

bob · Post by **bob** » Fri Jun 19, 2009 10:11 pm

Don wrote:

Even by the time I test 100,000 samples, I have only used 6 openings.
Again, I do not test like this. I run each change as a separate trial and play all games with version X before going on to version Y and version Z. I'd rather get an accurate assessment of X as soon as possible, rather than getting good results for X, Y and Z at the same time, but requiring 3x as long to do so.

I'm speaking only of orthogonal mult-testing. And what you are claiming makes no sense. You say you run each change as a separate trial, but HALF your players have that change and half do not. So you are still talking about an enormous time and while you are doing this you are actually testing ALL the changes together.

Now if you are not talking about multi-testing, then you are on the wrong thread.

Actually I am talking about _both_. Many of my tests simply don't pass the "uncorrelated features" smell test. And if they are correlated, then this test doesn't work. But some do.

I have _never_ tried, nor will I ever try, testing more than 3-4 different ideas at once. Because the chance for correlation goes up as you have more terms, as they can interact in unexpected ways. So far, I have not done more than a simple AB test where A is on or off and B is on or off. Rather than using 4,000 starting positions, I use 2000. But I use the same 2000 for all the test runs.

Let's make sure we understand HG's multi-testing idea - perhaps we are speaking a different language here.

The idea is that you have N things to test which are either ON or OFF. In my case N is 7. So I create 128 individual players in every possible combination. For any given feature being tested half the players have the change ON and half have the change OFF. In my implemention I just label the players with a binary code where each bit is 0 or 1 representing that the feature is ON or OFF. So player 0001001 has feature 0 on and feature 4 on and everything else off. 64 of the players have feature 4 on and 64 have feature 4 off.

We are talking the same, except I believe that 7 is pushing things beyond reason. Are you _certain_ that there is no correlation between any pair of those 7 parameters? How sure are you? How carefully did you measure this?

I hope H.G. Mueller can comment on whether this understand is correct.

Now how do you test only one feature at a time when 64 players have that feature? And how do you test it without testing the other features? I think that whatever you are doing is not mult-testing or else what I am doing is not mult-testing.

I don't follow. Appears to me we are talking about exactly the same thing.

I have two concerns with this testing, one of which is related to what I responded to previously.

1. More terms introduces more opportunities for correlation. A simple example might be to add rook mobility. And a rook on an open file or half-open file bonus. Turns out those are correleated, and in an obvious way. And this kind of testing won't work very well for those two things. However, there are others that are not so obvious (being correlated). And you don't discover it until you run the independent tests and then notice this. So I am probably far more conservative about what I "mix" using this approach.

2. I still use the same experimental setup. All the sub-matches use the same opponents, and same starting set of positions. So that the only thing that changes is the presence or absence of some parameter I am testing.

If I have two terms that I am pretty certain have no connections between them, for example a bishop pair bonus and a modification to my forward pruning, then I have two possibilities:

(1) test each idea separately. 32,000 games with pair bonus in, 32,000 games with it out. 32,000 games with new pruning idea in, 32,000 with it out. 4000 positions, 4 good opponents, two games per position.

(2) multi-test. 16,000 games with pair in, pruning off, 16,000 with pair in, pruning in, 16,000 games with pair off, pruning off, 16,000 games with pair off, pruning in. I have the same accuracy. 32,000 games with pair in and 32,000 games with pair out. 32,000 games with pruning on, 32,000 with pruning off. But I only played 64,000 games total. 1/2 the effort.

Are we talking apples-to-applies here?

As I play more and more games, my testing becomes like yours until it eventually converges to the same exact test you are doing. But I would never trust the early results of your test based on a single opening (or few openings) played over and over again thousands of times.
I would never trust the early results of _any_ test, so I assume we agree there.

That sounds real noble, but whether the result is early or not is totally arbitrary. It's whatever you have decided in advance is adequate. I think YOUR testing is truncated early if you define a complete test by my standards which is over 60 million games and would take me 16 months to urn on a fast quad computer. When I "stop early" it means I do not do the full 16 month run (it's more like 24 hours or something like that.)

I probably should write some time-sensitive analysis about testing. I have seen tests that at 1000 games, change looked very good, by 16,000 games, it looked worse than original, by 32,000 games it had climbed back up to a modest gain. I just had one that looked very bad at 1,000 games. It slowly climbed back and by 32,000 was a worthwhile modification. Early termination would miss those cases. Yep, if you get a -200 Elo change, you don't need many games. But I try to avoid testing ideas that are obviously ridiculous on the surface. For me -20 is a really bad result, +5 is a very good result. You can't detect those in a short test.

So do you think I am wrong not to run each test to 16 months?

Depends on how bad "bad" is. If it is only -20, yes I think you are wrong in that you are very likely missing some good results that take time to show up. To understand this, you can take a 32,000 game result, in the form of three letters, choose one for each game. A is win, B is draw, C is loss. Now write a simple program to run thru those 32,000 "letters" to find the 500 consecutive letters that produce the largest number of A's (good result) and the largest number of C's (bad result). This is an illustrative test for anyone. After having seen that, decide how confident you would be that after 500 games, you had not gotten the best 500 results, or the worst 500 results. Yes, I understand the central limit theorem. But a single sample of 500 from a population totalling 32,000 doesn't give you a very good idea if the opponents are all pretty equal in strength.

I think the only thing that's important is whether you run the test to adequate confidence, and you are not choosing your stopping point to conform to your belief system, i.e. "I like the results, so I think I'll stop before it changes!"

Question is, "what is that stopping point?" I certainly prefer to waste excessive test time over excessive development time that gets thrown away even though it really helps but needed more games to produce that result.

Orthogonal multi-testing

Re: Orthogonal multi-testing