Did you test for difference between the best move and second best move?Dann Corbit wrote: ↑Fri Jan 18, 2019 8:10 pmI have done that. I worked with Swaminathan in verification of the STS test suite. I took the three strongest engines of the day and ran for a full hour for each position. If the engines were not in agreement, I rejected the position and Swaminathan would give me a new position that fit the criteria of the test set we were working on. However, when I started the test set the strongest engine I had was 32 bit Rybka 2.3 and I had 4 core machines. Today, ten percent of the positions in STS are not valid. The reason is simple. The engines are exponentially stronger. The machines are exponentially stronger, and vastly better and deeper searches found yet better answers with a much deeper look.Look wrote: ↑Fri Jan 18, 2019 12:05 pmFor instance, what if the line is confirmed by several top engines ?Dann Corbit wrote: ↑Thu Jan 17, 2019 9:20 pm It is incredibly difficult to produce a tactical test suite with 1000 positions which is thoroughly debugged.
Even 100 positions is rather difficult.
I think that even if engines agree about the best move you may reject the position in case the difference in evaluation between the best move and second best move is less than 0.5 pawns.
Here is how to produce a tactical test suite which is thoroughly debugged if you do not insist that the test is going to be also hard for stockfish and the initial post said that the test is
Intended for second-tier engine testing - not Stockfish.
In this case you should take some pgn (let say 1000 chess games) and ask stockfish to analyze every position in it for fixed short time with multi-pv=2 in order to find candidates for the test.
You can decide that candidates are only positions when stockfish find a difference of more than 0.5 pawns.
Now test everyone of the candidate positions with weaker engines like Critter.
Most of the candidates are of course easy for the weaker engines but you can choose only the candidates that are not easy for the weaker engines
and I believe that you are going to get more than 1000 candidates that are not easy for at least part of the weaker engines that you use.
last step is verification that the candidates are good candidates and in order to do it you can use stockfish for more time with multi-pv=2.