Looking for a tactical test suite

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Uri Blass
Posts: 10267
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Looking for a tactical test suite

Post by Uri Blass »

Dann Corbit wrote: Fri Jan 18, 2019 8:10 pm
Look wrote: Fri Jan 18, 2019 12:05 pm
Dann Corbit wrote: Thu Jan 17, 2019 9:20 pm It is incredibly difficult to produce a tactical test suite with 1000 positions which is thoroughly debugged.
Even 100 positions is rather difficult.
For instance, what if the line is confirmed by several top engines ?
I have done that. I worked with Swaminathan in verification of the STS test suite. I took the three strongest engines of the day and ran for a full hour for each position. If the engines were not in agreement, I rejected the position and Swaminathan would give me a new position that fit the criteria of the test set we were working on. However, when I started the test set the strongest engine I had was 32 bit Rybka 2.3 and I had 4 core machines. Today, ten percent of the positions in STS are not valid. The reason is simple. The engines are exponentially stronger. The machines are exponentially stronger, and vastly better and deeper searches found yet better answers with a much deeper look.
Did you test for difference between the best move and second best move?

I think that even if engines agree about the best move you may reject the position in case the difference in evaluation between the best move and second best move is less than 0.5 pawns.

Here is how to produce a tactical test suite which is thoroughly debugged if you do not insist that the test is going to be also hard for stockfish and the initial post said that the test is
Intended for second-tier engine testing - not Stockfish.

In this case you should take some pgn (let say 1000 chess games) and ask stockfish to analyze every position in it for fixed short time with multi-pv=2 in order to find candidates for the test.

You can decide that candidates are only positions when stockfish find a difference of more than 0.5 pawns.

Now test everyone of the candidate positions with weaker engines like Critter.
Most of the candidates are of course easy for the weaker engines but you can choose only the candidates that are not easy for the weaker engines
and I believe that you are going to get more than 1000 candidates that are not easy for at least part of the weaker engines that you use.

last step is verification that the candidates are good candidates and in order to do it you can use stockfish for more time with multi-pv=2.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Looking for a tactical test suite

Post by Dann Corbit »

Uri Blass wrote: Fri Jan 18, 2019 8:54 pm
Dann Corbit wrote: Fri Jan 18, 2019 8:10 pm
Look wrote: Fri Jan 18, 2019 12:05 pm
Dann Corbit wrote: Thu Jan 17, 2019 9:20 pm It is incredibly difficult to produce a tactical test suite with 1000 positions which is thoroughly debugged.
Even 100 positions is rather difficult.
For instance, what if the line is confirmed by several top engines ?
I have done that. I worked with Swaminathan in verification of the STS test suite. I took the three strongest engines of the day and ran for a full hour for each position. If the engines were not in agreement, I rejected the position and Swaminathan would give me a new position that fit the criteria of the test set we were working on. However, when I started the test set the strongest engine I had was 32 bit Rybka 2.3 and I had 4 core machines. Today, ten percent of the positions in STS are not valid. The reason is simple. The engines are exponentially stronger. The machines are exponentially stronger, and vastly better and deeper searches found yet better answers with a much deeper look.
Did you test for difference between the best move and second best move?

I think that even if engines agree about the best move you may reject the position in case the difference in evaluation between the best move and second best move is less than 0.5 pawns.
That makes sense if the score is big.
But what if the score is very small.
If one position has a score of +44 centipawns and the other has a score of -6 centipawns I would suggest that the first position is really better.
Here is how to produce a tactical test suite which is thoroughly debugged if you do not insist that the test is going to be also hard for stockfish and the initial post said that the test is
Intended for second-tier engine testing - not Stockfish.
A test suite is independent of the engines if it is correct. However, it is hard or easy depending on both engines used and hardware.
On this machine:
Nodes/second CPUCores/Threads
4,801,341,606 NPS 128 cpu's x32 threads Cluster System/4096threads used by vondele
from here:
http://www.ipmanchess.yolasite.com/amd- ... -bench.php
the weakest SMP program will solve in a heartbeat
Whereas on a single core cell phone it may take days to solve the same problem, if it ever solves it.
In this case you should take some pgn (let say 1000 chess games) and ask stockfish to analyze every position in it for fixed short time with multi-pv=2 in order to find candidates for the test.
For the STS test, I solved both single pv and multi-pv with all three engines, which is the main method by which I achieved the various scores for the alternative moves. I agree that this is a good idea and very important for analysis of ANY test suite.
You can decide that candidates are only positions when stockfish find a difference of more than 0.5 pawns.
Any threshold is arbitrary. I think you have to decide for every position by using your brain, in this case.
If your goal is to have no problems with scores close to zero, then the 1/2 pawn arbitrary level might be OK.
Now test everyone of the candidate positions with weaker engines like Critter.
Most of the candidates are of course easy for the weaker engines but you can choose only the candidates that are not easy for the weaker engines
and I believe that you are going to get more than 1000 candidates that are not easy for at least part of the weaker engines that you use.
Sometimes, the weaker engines solve the problem faster.
Sometimes, the weaker engines find a better solution than the strong engine did.
last step is verification that the candidates are good candidates and in order to do it you can use stockfish for more time with multi-pv=2.
I did exactly that with STS, and yet the solutions age over time until some of them are simply wrong. I do not think that there are any sure and perfect solutions.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.