different kinds of testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jesper_nielsen

Re: different kinds of testing

Post by jesper_nielsen »

bob wrote:
jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way! :)

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! :) ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,
2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?
More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.
That to me is an interesting question. :D

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.
Ok! Thanks!

The reason option 1 looks tasty to me, is that it gives the option of iteratively adding more precision.

So you can run the test, look at the results, and decide if you think the change is good, bad or uncertain. Then if uncertain run the test again.

In this way there is an option to "break off" early, if a good or bad change is spotted, thereby saving some time.

But maybe having a large number of start positions you can break them up into chunks of a manageble number of positions. And them run the tests as needed?!

How to pick the positions to use in the tests?
One idea could be to take the positions where your program left book in the tournaments it has played in.

Another idea could be to randomly generate them by using your own book. So basically let your program pick a move, like it would in a real game, and then follow the line to the end of the book.
The pro is that the positions are biased towards positions your program is likely to pick in a tournament game.
The con is that the testing then inherits the blindspots from the opening book.

Thanks for the ideas! :)

Kind regards,
Jesper
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: different kinds of testing

Post by bob »

jesper_nielsen wrote:
bob wrote:
jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way! :)

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! :) ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,
2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?
More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.
That to me is an interesting question. :D

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.
Ok! Thanks!

The reason option 1 looks tasty to me, is that it gives the option of iteratively adding more precision.

So you can run the test, look at the results, and decide if you think the change is good, bad or uncertain. Then if uncertain run the test again.
You can do with with several thousand positions. Just use the first N, and if you want more accuracy, run the next N, and the next N, etc. Much safer than running the same positions over and over. For example, suppose you somehow get perfect reproduction for each move played. If you replay every game, you have 2x as many games, but every pair of results is identical. BayesElo will report a lower error bar, but it is wrong because there is perfect correlation between pairs of games. Using more positions rather than repeating old ones eliminates this.

In this way there is an option to "break off" early, if a good or bad change is spotted, thereby saving some time.

But maybe having a large number of start positions you can break them up into chunks of a manageble number of positions. And them run the tests as needed?!

How to pick the positions to use in the tests?
One idea could be to take the positions where your program left book in the tournaments it has played in.

Another idea could be to randomly generate them by using your own book. So basically let your program pick a move, like it would in a real game, and then follow the line to the end of the book.
The pro is that the positions are biased towards positions your program is likely to pick in a tournament game.
The con is that the testing then inherits the blindspots from the opening book.

Thanks for the ideas! :)

Kind regards,
Jesper
Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

bob wrote: Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
This is almost a perfect description of how I did it, the unix way.

One big difference is that I go only to ply 10. Is there any particular reason you picked ply 24?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: different kinds of testing

Post by bob »

Don wrote:
bob wrote: Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
This is almost a perfect description of how I did it, the unix way.

One big difference is that I go only to ply 10. Is there any particular reason you picked ply 24?
I tried several values. 10 always gave openings, and there are not really several thousand different positions there unless you go to _really_ off-the-beaten-path openings and include 1. g4 and such stuff. 24 gives even a few near endgames, I believe at 24, almost 50% of the positions are uncastled, which is a nice balance since what you do before castling is different than after. I wanted some dynamic attacking-type positions, some quiet positions, etc. I won't begin to claim that 24 plies is the optimal number. It was just a number that seemed to give a reasonable balance. Even worse, I just looked at the code and actually used 20, not 24. :) Not sure when I changed it. All my positions are WTM, move #11.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

bob wrote:
Don wrote:
bob wrote: Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
This is almost a perfect description of how I did it, the unix way.

One big difference is that I go only to ply 10. Is there any particular reason you picked ply 24?
I tried several values. 10 always gave openings, and there are not really several thousand different positions there unless you go to _really_ off-the-beaten-path openings and include 1. g4 and such stuff. 24 gives even a few near endgames, I believe at 24, almost 50% of the positions are uncastled, which is a nice balance since what you do before castling is different than after. I wanted some dynamic attacking-type positions, some quiet positions, etc. I won't begin to claim that 24 plies is the optimal number. It was just a number that seemed to give a reasonable balance. Even worse, I just looked at the code and actually used 20, not 24. :) Not sure when I changed it. All my positions are WTM, move #11.
I'm not sure of what the best number is either. I used the million game database and picked 10, basically due to my desire to get the program out of book as early as reasonably possible. I think I probably do have a few off-beat openings as a result. I don't remember what N was, but it was fairly low.

I did do a couple of other things. I culled out duplicate games so that I would not get a false read. And I ran each opening through my chess program and attached the 64 bit posiiton hash key to the final position. Using uniq I found a few transpositions that I removed. Then I ran these positions again through the tester but called it a draw after a few more moves. I removed any position that were duplicates after a couple more ply or something like that. In other word in an imperfect way I tried to pick up some openings that were "likely" to transpose. I doubt this was very useful, but it made me feel better!
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: different kinds of testing

Post by michiguel »

bob wrote:
jesper_nielsen wrote:
bob wrote:
jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way! :)

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! :) ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,
2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?
More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.
That to me is an interesting question. :D

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.
Ok! Thanks!

The reason option 1 looks tasty to me, is that it gives the option of iteratively adding more precision.

So you can run the test, look at the results, and decide if you think the change is good, bad or uncertain. Then if uncertain run the test again.
You can do with with several thousand positions. Just use the first N, and if you want more accuracy, run the next N, and the next N, etc. Much safer than running the same positions over and over. For example, suppose you somehow get perfect reproduction for each move played. If you replay every game, you have 2x as many games, but every pair of results is identical. BayesElo will report a lower error bar, but it is wrong because there is perfect correlation between pairs of games. Using more positions rather than repeating old ones eliminates this.

In this way there is an option to "break off" early, if a good or bad change is spotted, thereby saving some time.

But maybe having a large number of start positions you can break them up into chunks of a manageble number of positions. And them run the tests as needed?!

How to pick the positions to use in the tests?
One idea could be to take the positions where your program left book in the tournaments it has played in.

Another idea could be to randomly generate them by using your own book. So basically let your program pick a move, like it would in a real game, and then follow the line to the end of the book.
The pro is that the positions are biased towards positions your program is likely to pick in a tournament game.
The con is that the testing then inherits the blindspots from the opening book.

Thanks for the ideas! :)

Kind regards,
Jesper
Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
In the first 40 to 50 positions you have a quasi-duplicate. Some time ago I downloaded your positions and went through the first ones out of curiosity. The positions are a french exchange. They have a small difference, which is either one black to move and the other white to move or something small like that (or one position in which one side lost a tempo or so, I can't remember exactly). Maybe it is not terribly important considering the size you are dealing with, but I figure you may like to know. You may have more cases like this. If you like I can try to dig it. The positions are technically different but you will discard them if you see them. Since they are ordered by FEN, they are not exactly together. Now that I say, this, I think one position has a pawn in g5 an the other in g6 (when black should be playing g5).

I know it is very difficult to select this things. I started to choose manually positions from ECO. That will be a high quality selection, but it takes forever. One day I will finish :-)

Miguel
Dann Corbit
Posts: 12564
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: different kinds of testing

Post by Dann Corbit »

michiguel wrote:
bob wrote:
jesper_nielsen wrote:
bob wrote:
jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way! :)

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! :) ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,
2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?
More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.
That to me is an interesting question. :D

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.
Ok! Thanks!

The reason option 1 looks tasty to me, is that it gives the option of iteratively adding more precision.

So you can run the test, look at the results, and decide if you think the change is good, bad or uncertain. Then if uncertain run the test again.
You can do with with several thousand positions. Just use the first N, and if you want more accuracy, run the next N, and the next N, etc. Much safer than running the same positions over and over. For example, suppose you somehow get perfect reproduction for each move played. If you replay every game, you have 2x as many games, but every pair of results is identical. BayesElo will report a lower error bar, but it is wrong because there is perfect correlation between pairs of games. Using more positions rather than repeating old ones eliminates this.

In this way there is an option to "break off" early, if a good or bad change is spotted, thereby saving some time.

But maybe having a large number of start positions you can break them up into chunks of a manageble number of positions. And them run the tests as needed?!

How to pick the positions to use in the tests?
One idea could be to take the positions where your program left book in the tournaments it has played in.

Another idea could be to randomly generate them by using your own book. So basically let your program pick a move, like it would in a real game, and then follow the line to the end of the book.
The pro is that the positions are biased towards positions your program is likely to pick in a tournament game.
The con is that the testing then inherits the blindspots from the opening book.

Thanks for the ideas! :)

Kind regards,
Jesper
Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
In the first 40 to 50 positions you have a quasi-duplicate. Some time ago I downloaded your positions and went through the first ones out of curiosity. The positions are a french exchange. They have a small difference, which is either one black to move and the other white to move or something small like that (or one position in which one side lost a tempo or so, I can't remember exactly). Maybe it is not terribly important considering the size you are dealing with, but I figure you may like to know. You may have more cases like this. If you like I can try to dig it. The positions are technically different but you will discard them if you see them. Since they are ordered by FEN, they are not exactly together. Now that I say, this, I think one position has a pawn in g5 an the other in g6 (when black should be playing g5).

I know it is very difficult to select this things. I started to choose manually positions from ECO. That will be a high quality selection, but it takes forever. One day I will finish :-)

Miguel
There is also exactly one STS problem in his set, this one:
[d]r2qkb1r/1b1n1ppp/p3pn2/1pp5/3PP3/2NB1N2/PP3PPP/R1BQ1RK1 w kq - acd 23; acn 2697583173; acs 676816; bm d5; ce 3; pv d5 Qc7 Bc2 Bd6 dxe6 fxe6 Ng5 Nf8 f4 O-O-O Qe2 h6 Nh3 e5 f5 c4 a4 b4 Nd1 Kb8 Ne3 Bc5 Nf2 Rd4 Rd1 Rxd1+ Nfxd1 N8d7 Qxc4 Rc8 Qe2; id "Undermine.093";
Dann Corbit
Posts: 12564
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: different kinds of testing

Post by Dann Corbit »

This position is an easy win for white:
[d]2kr1b1r/pp3ppp/n2Pqn2/1Np5/3Pp1b1/2N1Q3/PPP2PPP/R1B1KB1R w KQ -

Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.

Code: Select all

89)                     
    Avoid move: 
    Best move (Rybka 3): d4xc5
    Not found in: 00:53
      2	00:00	         288	294.912	+1.98	d4xc5
      3	00:00	         496	507.904	+1.87	d4xc5
      4	00:00	         891	57.024	+1.82	d4xc5
      5	00:00	       1.545	98.880	+1.89	d4xc5 Na6b4
      6	00:00	       6.204	135.168	+1.59	d4xc5 Na6b4 Qe3d2 e4e3
      7	00:00	      10.379	168.699	+1.72	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3
      8	00:00	      19.993	186.116	+1.73	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 Nc3xd5
      9+	00:01	      36.459	170.474	+1.93	d4xc5
      9	00:01	      45.060	184.565	+1.92	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 Nc3xd5 Qe5xd5 b2b4 g7g6 c2c4
     10+	00:01	      88.681	153.135	+2.12	d4xc5
     10	00:01	     142.518	166.977	+1.79	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 Nc3xd5 Nf6xd5 Qe3d2 Bg4d7 Nd4b3 e4e3
     11+	00:02	     278.310	170.652	+1.99	d4xc5
     11+	00:04	     649.045	179.724	+2.19	d4xc5
     11	00:04	     726.839	180.695	+2.18	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 f2f4 Qe5h5 Nc3xd5 Nf6xd5 Qe3b3 e4e3 c5c6
     12	00:05	     859.653	181.427	+2.18	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 f2f4 Qe5h5 Nc3xd5 Nf6xd5 Qe3b3 e4e3 c5c6
     13+	00:13	   2.200.603	171.950	+2.38	d4xc5
     13+	00:20	   3.168.556	158.280	+2.58	d4xc5
     13	00:29	   4.513.016	156.734	+2.71	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 f2f4 Qe5h5 Nc3xd5 Qh5xd5 b2b4 Bg4d7 h2h3 Qd5h5 Bf1e2 Qh5h4+ Qe3f2 Qh4xf2+ Ke1xf2 g7g6 Be2c4 Bf8g7 c2c3 Rh8f8
     14	00:32	   5.007.718	158.337	+2.71	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 f2f4 Qe5h5 Nc3xd5 Qh5xd5 b2b4 Bg4d7 h2h3 Qd5h5 Bf1e2 Qh5h4+ Qe3f2 Qh4xf2+ Ke1xf2 g7g6 Be2c4 Bf8g7 c2c3 Rh8f8
     15	00:38	   6.059.560	161.097	+2.71	d4xc5 Na6b4 Nb5d4 Qe6e5 a2a3 Nb4d5 f2f4 Qe5h5 Nc3xd5 Qh5xd5 b2b4 Bg4d7 h2h3 Qd5h5 Bf1e2 Qh5h4+ Qe3f2 Qh4xf2+ Ke1xf2 g7g6 Be2c4 Bf8g7 c2c3 Rh8f8
   11/14/2009 1:05:52 AM, Time for this analysis: 00:00:53, Rated time: 1:18:37
User avatar
Eelco de Groot
Posts: 4576
Joined: Sun Mar 12, 2006 2:40 am
Full name:   

Re: different kinds of testing

Post by Eelco de Groot »

Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco