bob wrote:
I don't think that is safe. My tester plays each position twice against all opponents. I would not use a different set of opening positions for each test, that invites bias. In my testing script, I can tell the thing how many positions to use. Rather than 4000, if I were doing two tests mixed together,
I suspect that you misunderstand what I'm doing. A given player will play both sides of each opening for each opponent. No opening sequences will ever get repeated.
However, what is different is that when I do a new test, the openings will get played in a different order for each pair of players.
In a normal test I use about 4,000 positions, vs 4 opponents, double-RR style so I get 32,000 games for a test.
That's pretty much a perfect description of how I do it, although I sometimes have more or less than 4 players. With OMT I have 128 players as I test 7 different parameters!
If I test A on, then A off, then B on, then B off, that would be 128,000 games. Using the orthogonal approach, I would use 1/2 the positions, so that I have 4000 games with A on B off, 4000 with A on B on, 4000 with A off B on, and 4000 with A off B off. That gives me 8000 games with A on, 8000 games with A off, ditto for B. 1/2 the games of the original test to get the same accuracy. If (a) there are two independent things to test at the same time and (b) they are truly independent.
Yes, that's exactly how I do it.
I'm not a fan of different positions for different tests, how can you be sure that the positions don't introduce some sort of bias where set a is more dependent on one of the changes than set b?
I think you are not understanding why I am doing it this way. They way I'm doing it is exactly the same as the way you are doing it, if I allow the full set of 4000 positions to be played with each possible matchup.
But what about the case where you stop early? Let's do the math. I have 7 parameter and 128 combinations of parameters. Each player plays each color against each opponent for each opening. So how many games is that for just 1 opening? Each player will play 127 games as white for a given opening so we have 128 * 127 = 16256 games. You would divide that by 2 since each player is really playing half a game. But since you are playing the opening twice, once for white and once for black you have 16256 games for just a single opening.
For any given positional term, half of the players have the term ON and half of their opponents have the term OFF. So I think 1/4 of the games can be use to measure how well a given term works. So for a single positional term you have an effective sample size of 4064 games after you have played just 1 opening!
As you know 4,064 games is not a large enough sample to measure small changes, but it's pretty impressive considering that it was achieved with a single opening! Also, it's at least starting to be enough games to notice major trends or problems.
Since it's impractical for me to play 16256 * 4000 = 65 million plus games to get a result, I am probably going to stop this test early. In fact, I may have enough data to draw some tentative conclusions after only 3 or 4 openings have been played.
But as you yourself point out, do I really want to draw conclusions based on how well some evaluation parameter does using just 4 openings? Even though that is over a 16,000 game sample, it's may not be that valid because those 4 openings may have made the parameter look especially good or bad and I could draw the wrong conclusion.
However, if the starting positions were chosen in a randomly distributed way, after 16,000 games I would know that each positional feature will have been tested with a huge variety of openings. It's not very important that a specific PROGRAM only saw 4 openings against some other specific program (which I think is what you are unduly worried about), but what is much more relevant is that each positional term gets representation over a variety of opening systems.