Using test suites -- when to say the position is solved?

JVMerlino · Post by **JVMerlino** » Thu Aug 13, 2009 4:24 am

When testing my latest version on test suites, I use GradualTest, which automates the process nicely. However, you can only tell the program to have the engine think for a specified number of seconds. If the engine doesn't have the correct move at the moment when time runs out, it counts as an unsolved position.

However, the engine might have found the solution earlier and discarded it before time ran out.

Looking at the site for one of these tests (the Eigenmann Endgame Test, found at http://glareanverlag.wordpress.com/2007 ... -endspiele) , I see that many engines were run on that test, and the total time required to find the solution is also considered as a factor. But this means that the engine stopped looking at the position as soon as the solution was found, right?

I can see both sides of the argument:
1) Fastest to the solution is an important factor. But sometimes the engine might just stumble onto a position for a split second, especially at early depths, and that shouldn't count.
2) Not holding onto the solution indefinitely should be considered a failure. But time controls are also a factor. If the engine would have played the move in a shorter amount of time than was allotted, that should count for something, right?.

So is the "solution" to use GradualTest as well as implement an epd parser within my engine so I can get both kinds of results?

Or is one of these far more important than the other?

Thoughts are appreciated....

jm

plattyaj · Post by **plattyaj** » Thu Aug 13, 2009 4:38 am

I like the way Arena does it - you can tell it to consider a move found if it's held for x consecutive plies (you can also tell it to only start counting from ply y too). That's pretty good for having solid moves not take the full time.

I did find one situation that was amusing ... I had set it to consider a move found after 4 consecutive plies. Right when Arena decided it had been met, Schola changed it's mind, sent the move and Arena used that to declare it had failed ... even though there was plenty of time left for analysis to continue. At first I considered it a bug in Arena but I've decided it's just luck of the draw!

Andy.

Richard Allbert · Post by **Richard Allbert** » Thu Aug 13, 2009 6:38 am

There's also an option in arena to use the whole allocated time, regardless of search result.

IIRC, The option isn't in the epd dialog, but in the general settings window.

Richard

Glarean · Post by **Glarean** » Thu Aug 13, 2009 9:03 am

JVMerlino wrote: Looking at the site for one of these tests (the Eigenmann Endgame Test, found at http://glareanverlag.wordpress.com/2007 ... -endspiele) , I see that many engines were run on that test, and the total time required to find the solution is also considered as a factor. But this means that the engine stopped looking at the position as soon as the solution was found, right?

No, each engine had exactly 60 seconds.
Found or not found, that was the question...

Regards: Walter

JVMerlino · Post by **JVMerlino** » Thu Aug 13, 2009 5:34 pm

So the total time used that is displayed for each engine was calculated manually? If not, what tool was used to get those statistics? Arena?

Walter, your suite is brutal on my engine.

Myrddin only gets 10 out of 100 at 60 seconds (on a P4-3.0GHz). At least your site shows that this result is not too far off from other engines of similar strength.

jm

sje · Post by **sje** » Fri Aug 14, 2009 6:52 pm

Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.

bob · Post by **bob** » Fri Aug 14, 2009 8:08 pm

sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.

This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.

I have a mechanism in Crafty that lets me tell it to run each position in a test suite, and if it holds the correct move for N successive iterations, terminate that position and move on. For serious testing I always set N to 99 to prevent an early exit where I would not notice that it will later change to a wrong move.

sje · Post by **sje** » Fri Aug 14, 2009 9:13 pm

bob wrote:
sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.

While this can happen, it doesn't happen all that often depending on the program and the position. And sometimes when the early correct response is overridden by a later response, that later response can itself be overridden by a re-emergence of the earlier, correct response.

bob · Post by **bob** » Fri Aug 14, 2009 9:46 pm

sje wrote:
bob wrote:
sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.
While this can happen, it doesn't happen all that often depending on the program and the position. And sometimes when the early correct response is overridden by a later response, that later response can itself be overridden by a re-emergence of the earlier, correct response.

Yes, but it still makes the "first solution" wrong because it was made for the wrong reason. Last time I tried Crafty on WAC, for example, and told it to quit after one iteration with correct move, it gets 299 correct in almost zero time. If I keep stepping it up, it will drop to 297, then back to 298, and eventually back to 299. Once I reach 10 seconds or so it also will find #230 at various plies...

I don't want the "lucky finds"...

Dann Corbit · Post by **Dann Corbit** » Fri Aug 14, 2009 10:03 pm

bob wrote:
sje wrote:
bob wrote:
sje wrote:Note that if a sufficiently fast forced mate (or loss) is found, the program should exit the test early for that position.

I test a set of positions iteratively (starting at one second per position) with each test allowing twice the time than the prior test run. Problems solved in one run are not tried in succeeding runs.
This can lead to misleading results. There are plenty of positions where a program will like the correct move at shallow depths, then change to an inferior move at deeper depths.
While this can happen, it doesn't happen all that often depending on the program and the position. And sometimes when the early correct response is overridden by a later response, that later response can itself be overridden by a re-emergence of the earlier, correct response.
Yes, but it still makes the "first solution" wrong because it was made for the wrong reason. Last time I tried Crafty on WAC, for example, and told it to quit after one iteration with correct move, it gets 299 correct in almost zero time. If I keep stepping it up, it will drop to 297, then back to 298, and eventually back to 299. Once I reach 10 seconds or so it also will find #230 at various plies...

I don't want the "lucky finds"...

I see at least three kinds of solutions for EPD test suites that are independent of the actual engines used.
1. Absolute solutions.
An absolute solution is a solution where the shortest possible mate has been physically proven beyond any shadow of a doubt. This is the only unarguable solution as the best move.
2. Timed solutions. These solutions are far more nebulous than those above.
A timed solution is the solution posted after a given think time. So, if the think time is 10 seconds and your program suggests Nxe3, and the best move is Nxe3, then the solution should be scored as correct, even if the ce score is bogus junk. The engine author may still have an action item here to improve the score, and it may be an accidental solution as well.
3. Timed solutions with early escape for iterative agreement (e.g. you get 30 seconds, but if the solution agrees for seven consecutive plies, you can stop searching and go on to the next problem). These are actually even more tenuous than type 2, because on the 8th iteration the engine may have chosen a different move.

Other possibilities include:
4. Ply depth searches (e.g. search for 9 plies)
5. Node count solutions (e.g. search for 100 million nodes)

These two solutions (4 and 5) are totally engine specific, in that there is no sensible way to compare them with the same sorts of solutions for other engines. They are probably useful to the engine author for some purposes or to someone who is curious about a particular engine (e.g. if I tweak parameter dangerouspassedpawnvalue to twice its current value and then run pawntest.epd for 100 million nodes, will it solve it faster or better). But node counts and depths would be a bad way to make comparisions between different engines.

I am sure that there are lots of other methods that can be used.

Using test suites -- when to say the position is solved?

Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?

Re: Using test suites -- when to say the position is solved?