different kinds of testing

bob · Post by **bob** » Sun Nov 15, 2009 8:16 pm

jwes wrote:
michiguel wrote:
Don wrote:
michiguel wrote: Yes, I agree. The game will vary and you may lose both times in a different way if you do not understand the position . Seriously, you are right for most of the positions, where the chaos introduced in bigger than consistent problems. BUT, there are positions in which the chaos introduced is not enough to overcome another factors. Those positions maybe few, but it does not mean they do not exist. For instance, I had a position (Sicilian Sveshnikov) that my engine insisted to sacrifice a bishop in b5. It did not understand it and it go hammered with both, white and black. Rarely got a draw against engines of similar strength until I tune a parameter in eval. The score improved in both, white and black games. so, the results of both games were strongly correlated and statistically, they worth only one game.
That's a good example and it well illustrates your point.

I think we can say with agreement that not all positions have the same statistical relevant due to this phenomenon.

This give me an idea. Why not keep statistics on which specific starting positions are having results that are too consistent? For instance white always wins, or it's always a draw. Then look at the games and see if there is something to be improved in the program (or whether the book ends in a won position for someone or a perpetual.)

Exactly my point, but you need many other sparring engines to make it significant without "manual inspection".

Miguel
If a position is always a win or always a draw, it dosen't add any information, so the position should be discarded from the test. I think a more interesting statistic would be if a particular engine does relatively worse from both sides on a given position. This would indicate that that program has a hole in its evaluation in that position or succeeding positions.

This is actually one thing I filter out in the results we produce. If we lose both sides, then there is apparently a key piece of evaluation that we are missing so that we don't know how to exploit that feature from one side, nor recognize and defend against it from the other.

Splitting on a position (white always wins) is not necessarily a position that needs to be eliminated. You need to test this against many opponents. And even then equal positions are really useful as a bad change can swing 'em quickly.

jesper_nielsen · Post by **jesper_nielsen** » Mon Nov 16, 2009 9:56 am

bob wrote:
jesper_nielsen wrote:
bob wrote:
jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way!

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,

2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?

More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.

That to me is an interesting question.

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.
Ok! Thanks!

The reason option 1 looks tasty to me, is that it gives the option of iteratively adding more precision.

So you can run the test, look at the results, and decide if you think the change is good, bad or uncertain. Then if uncertain run the test again.
You can do with with several thousand positions. Just use the first N, and if you want more accuracy, run the next N, and the next N, etc. Much safer than running the same positions over and over. For example, suppose you somehow get perfect reproduction for each move played. If you replay every game, you have 2x as many games, but every pair of results is identical. BayesElo will report a lower error bar, but it is wrong because there is perfect correlation between pairs of games. Using more positions rather than repeating old ones eliminates this.

In this way there is an option to "break off" early, if a good or bad change is spotted, thereby saving some time.

But maybe having a large number of start positions you can break them up into chunks of a manageble number of positions. And them run the tests as needed?!

How to pick the positions to use in the tests?
One idea could be to take the positions where your program left book in the tournaments it has played in.

Another idea could be to randomly generate them by using your own book. So basically let your program pick a move, like it would in a real game, and then follow the line to the end of the book.
The pro is that the positions are biased towards positions your program is likely to pick in a tournament game.
The con is that the testing then inherits the blindspots from the opening book.

Thanks for the ideas!

Kind regards,
Jesper
Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.

Ok!

I have made a variant of my book building program to gather test positions after 20 plies.

I like to have the moves from the starting position as opposed to just the FEN, so it gathers the moves made the first time a position is seen, and then generates a PGN file with the resulting positions sorted by number of occurences.

The problem now becomes finding a suitable PGN game collection to generate the data from! Time to look at my SCID TWIC database!

Kind regards,
Jesper

different kinds of testing

Re: different kinds of testing

Re: different kinds of testing