Using test suites -- when to say the position is solved?

Discussion of chess software programming and technical issues.

Moderator: Ras

JVMerlino
Posts: 1404
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Using test suites -- when to say the position is solved?

Post by JVMerlino »

When testing my latest version on test suites, I use GradualTest, which automates the process nicely. However, you can only tell the program to have the engine think for a specified number of seconds. If the engine doesn't have the correct move at the moment when time runs out, it counts as an unsolved position.

However, the engine might have found the solution earlier and discarded it before time ran out.

Looking at the site for one of these tests (the Eigenmann Endgame Test, found at http://glareanverlag.wordpress.com/2007 ... -endspiele) , I see that many engines were run on that test, and the total time required to find the solution is also considered as a factor. But this means that the engine stopped looking at the position as soon as the solution was found, right?

I can see both sides of the argument:
1) Fastest to the solution is an important factor. But sometimes the engine might just stumble onto a position for a split second, especially at early depths, and that shouldn't count.
2) Not holding onto the solution indefinitely should be considered a failure. But time controls are also a factor. If the engine would have played the move in a shorter amount of time than was allotted, that should count for something, right?.

So is the "solution" to use GradualTest as well as implement an epd parser within my engine so I can get both kinds of results?

Or is one of these far more important than the other?

Thoughts are appreciated....

jm
plattyaj

Re: Using test suites -- when to say the position is solved?

Post by plattyaj »

I like the way Arena does it - you can tell it to consider a move found if it's held for x consecutive plies (you can also tell it to only start counting from ply y too). That's pretty good for having solid moves not take the full time.

I did find one situation that was amusing ... I had set it to consider a move found after 4 consecutive plies. Right when Arena decided it had been met, Schola changed it's mind, sent the move and Arena used that to declare it had failed ... even though there was plenty of time left for analysis to continue. At first I considered it a bug in Arena but I've decided it's just luck of the draw!

Andy.
Richard Allbert
Posts: 794
Joined: Wed Jul 19, 2006 9:58 am

Re: Using test suites -- when to say the position is solved?

Post by Richard Allbert »

There's also an option in arena to use the whole allocated time, regardless of search result.

IIRC, The option isn't in the epd dialog, but in the general settings window.

Richard
Glarean
Posts: 262
Joined: Sun Oct 05, 2008 1:04 pm
Location: Switzerland
Full name: Walter Eigenmann

Re: Using test suites -- when to say the position is solved?

Post by Glarean »

JVMerlino wrote: Looking at the site for one of these tests (the Eigenmann Endgame Test, found at http://glareanverlag.wordpress.com/2007 ... -endspiele) , I see that many engines were run on that test, and the total time required to find the solution is also considered as a factor. But this means that the engine stopped looking at the position as soon as the solution was found, right?
No, each engine had exactly 60 seconds.
Found or not found, that was the question...

Regards: Walter
JVMerlino
Posts: 1404
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: Using test suites -- when to say the position is solved?

Post by JVMerlino »

So the total time used that is displayed for each engine was calculated manually? If not, what tool was used to get those statistics? Arena?

Walter, your suite is brutal on my engine. :D Myrddin only gets 10 out of 100 at 60 seconds (on a P4-3.0GHz). At least your site shows that this result is not too far off from other engines of similar strength.

jm
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Using test suites -- when to say the position is solved?

Post by sje »

Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Using test suites -- when to say the position is solved?

Post by bob »

sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.

I have a mechanism in Crafty that lets me tell it to run each position in a test suite, and if it holds the correct move for N successive iterations, terminate that position and move on. For serious testing I always set N to 99 to prevent an early exit where I would not notice that it will later change to a wrong move.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Using test suites -- when to say the position is solved?

Post by sje »

bob wrote:
sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.
While this can happen, it doesn't happen all that often depending on the program and the position. And sometimes when the early correct response is overridden by a later response, that later response can itself be overridden by a re-emergence of the earlier, correct response.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Using test suites -- when to say the position is solved?

Post by bob »

sje wrote:
bob wrote:
sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.
While this can happen, it doesn't happen all that often depending on the program and the position. And sometimes when the early correct response is overridden by a later response, that later response can itself be overridden by a re-emergence of the earlier, correct response.
Yes, but it still makes the "first solution" wrong because it was made for the wrong reason. Last time I tried Crafty on WAC, for example, and told it to quit after one iteration with correct move, it gets 299 correct in almost zero time. If I keep stepping it up, it will drop to 297, then back to 298, and eventually back to 299. Once I reach 10 seconds or so it also will find #230 at various plies...

I don't want the "lucky finds"...
Dann Corbit
Posts: 12792
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Using test suites -- when to say the position is solved?

Post by Dann Corbit »

bob wrote:
sje wrote:
bob wrote:
sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.
While this can happen, it doesn't happen all that often depending on the program and the position. And sometimes when the early correct response is overridden by a later response, that later response can itself be overridden by a re-emergence of the earlier, correct response.
Yes, but it still makes the "first solution" wrong because it was made for the wrong reason. Last time I tried Crafty on WAC, for example, and told it to quit after one iteration with correct move, it gets 299 correct in almost zero time. If I keep stepping it up, it will drop to 297, then back to 298, and eventually back to 299. Once I reach 10 seconds or so it also will find #230 at various plies...

I don't want the "lucky finds"...
I see at least three kinds of solutions for EPD test suites that are independent of the actual engines used.
1. Absolute solutions.
An absolute solution is a solution where the shortest possible mate has been physically proven beyond any shadow of a doubt. This is the only unarguable solution as the best move.
2. Timed solutions. These solutions are far more nebulous than those above.
A timed solution is the solution posted after a given think time. So, if the think time is 10 seconds and your program suggests Nxe3, and the best move is Nxe3, then the solution should be scored as correct, even if the ce score is bogus junk. The engine author may still have an action item here to improve the score, and it may be an accidental solution as well.
3. Timed solutions with early escape for iterative agreement (e.g. you get 30 seconds, but if the solution agrees for seven consecutive plies, you can stop searching and go on to the next problem). These are actually even more tenuous than type 2, because on the 8th iteration the engine may have chosen a different move.

Other possibilities include:
4. Ply depth searches (e.g. search for 9 plies)
5. Node count solutions (e.g. search for 100 million nodes)

These two solutions (4 and 5) are totally engine specific, in that there is no sensible way to compare them with the same sorts of solutions for other engines. They are probably useful to the engine author for some purposes or to someone who is curious about a particular engine (e.g. if I tweak parameter dangerouspassedpawnvalue to twice its current value and then run pawntest.epd for 100 million nodes, will it solve it faster or better). But node counts and depths would be a bad way to make comparisions between different engines.

I am sure that there are lots of other methods that can be used.