I know there are statistical fluctuations, but I don't agree with a testing methodology based on a lot of games, at least at long time controls.Dann Corbit wrote: ↑Mon Aug 31, 2020 4:18 amIf ShashChess really is as strong as Stockfish at game playing and really is as strong as Houdini at tactical solving, then that is having your cake and eating it too.
It would be very soothing to see the results of a 1000 game contest, as far as confidence goes.
One other thing, the strongest tactical program is Bluefish XI-LP FD (Tactical=2; defensive=off)
Very often, for example, Stockfish patches are validated and then removed.
For example, if you play even 50000 games on the Sicilian poisoned pawn variation, you'll get all draw, because a top engine plays this variation "perfectly".
The starting positions to test must be carefully choosen and not random. As every true chess player knows, the characteristics concept is very poweful not only for practical chess player, but also for testers to have sample positions.
Given a position, its characteristics is defined by its pawn structure, kings position and bishop pair. The idea is if two quiescent positions have the same characteristics, the plans/manoeuvres are the same.
For this, I choosen the 10 most frequent characteristics and for them the most sharp variations.
This is my humble opinion.
Finally, about Bluefish XI-LP FD, I imagine you tested this engine on the same positions battery to came to this conclusion.
I will do it because I never used this engine.
Thanks anyway for your observations.