Is there ANY useful testsuite?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

Jouni
Posts: 3227
Joined: Wed Mar 08, 2006 8:15 pm

Is there ANY useful testsuite?

Post by Jouni »

Test suites are fun and nice entertainment! But Is there any useful? Testsuite must estimate engine strength faster/more precise than using same time for playing games against other engines. I doubt there is any.
Jouni
Magnum
Posts: 162
Joined: Thu Feb 04, 2021 10:24 pm
Full name: Arnold Magnum

Re: Is there ANY useful testsuite?

Post by Magnum »

Jouni wrote: Tue Nov 15, 2022 1:02 pm Test suites are fun and nice entertainment! But Is there any useful? Testsuite must estimate engine strength faster/more precise than using same time for playing games against other engines. I doubt there is any.
You are 20 years to late.
Here you will find all position which Stockfish can't solve in 1 second:
Stockfish Test Suite 2022: https://www.mediafire.com/file/dg8q0qcf ... 2.pgn/file
peter
Posts: 3167
Joined: Sat Feb 16, 2008 7:38 am
Full name: Peter Martan

Re: Is there ANY useful testsuite?

Post by peter »

Jouni wrote: Tue Nov 15, 2022 1:02 pm Test suites are fun and nice entertainment! But Is there any useful? Testsuite must estimate engine strength faster/more precise than using same time for playing games against other engines. I doubt there is any.
I'd say that depends on what you mean by "faster/more precise", Jouni.

How many games of which (and how many) engines do you have to play to get estimations more precise than statistical error bar of the results were?

To get estimations more precise than error bar you get very much faster with most of the testuites around, even if I have to confess, some of them are without any meaning to be compared to any other kind of more reasonable measurement, to say the least, but if you stick to the questions you can answer with good testuites at all (well run and evaluated too of course) you'll get the answers according to the questions much sooner and much more precisely.

E.g I'm using at least three to four suites most of the times, lately I've added Ed Schröder's and- Ferdinand Mosca's new version of STS with Ferdy's fine MEA- tool, instead of the full 1500- set I've made a smaller one suite of 888 out of 594 STS- positions for my personal use, not as easy as the rest of the 1500, these I let run together with Eret- suite- positions and those out of Arasan 21.0- testsuite and some more positions out of my own database to fill up to the 888 here:

https://www.dropbox.com/s/1m3cnrnqtq01q ... 8.epd?dl=0

Regard: those are meant to be evaluated with MEA- tool.

As for single best move- suites, I have several more of them in usage, to stick to the ones to be used short- TC too (not very short like the 888, which are meant for about 200msec/pos., these positions are easy enough to get evaluated almost without any search, just out of "static eval"), not to talk about LTC- test- postions, which I'd rather judge one by one only, here are the two by me mostly used tactical single best moves suites:

https://www.dropbox.com/s/804b7chwli13laf/1284.epd?dl=0

and

https://www.dropbox.com/s/lpg29zoyvh03dza/256.epd?dl=0

Those are to be run with 15"/pos. (the first one link of 128 pos.) or with 5" (the 256), using SMP on modern hardware, the "strategic" test suites of at least 888 pos. at VSTC can be run sinlge -threaded too.

The more difficult (tactically and as for time to solution) the positions are (I yet don't include too difficult ones for STC on modern hardware and engines neither), the longer the TC is to be set of course, also depending on the engines you want to compare to each other.
The answers you get that way of course don't answer questions about "overall playing strength" (a term that never ever anyhow was of any definable meaning, playing with which one TC, which one engine- pool, which one opening- test- postions- set?) like the answers and results you get by game playing.

Yet I'd say (and can give examples of ranking- and rating- lists I made that way), the very short TC- suites show results similar to very short TC- game- playing, the tactical single best move- suites show in a way almost the opposite, in those the engines (and settings) prevail, that are good at problem- solving, which isn't as important in STC- game playing. So you'd have to have probably more than one suite with its own result seen together with the results of other suites, TCs and engines, to get a more "overall" view, but that's principally so with eng-eng-matches too, isnt' it?
Especially neihter absolute numeric heights nor differences of Elo-results out of matches with different engines, hardware- TC and openings are really transitive to each other anymore (they never were, but some years ago still much more than nowadays), are they?

And then the crucial point of personal interest: I myself am interested in the results of tactical test suites even more than as for those of STC- game- playing, when it comes to analysing difficult sinlge postions deeply, not to talk about puzzles and DTM- questions.

Lately almost any rating- and ranking- match is run with unbalanced opening- postions of some kind only, to get any Elo- dfferences out of draw- death of eng-eng-games at all, the dependency of any rating and ranking is strictly position- dependent anyhow too, to get precise answers to precise questions soon and more precisely, I''d prefer test suites that aren't to be outplayed to full games only at all. (As for hardware- time necessary it's not really a question at all, those costs of hardware- time and electriciy are just uncomparable to each other).

And to get really precise answers about time to solution, time to best output- line and time to best eval, you always have to stick to single positions at all anyhow. There and only there you see the dependency of "playing strength" of engines (and human players) to single positions of chess. You don't have to look (closely) at these always present strict position- dependency of playing- strength- measurements, but if you want to look at them (closely), you have to look closely at the single positions, of analysis, of automatically run test suites and of openig sets for game- playing as well.
That means, you have to have some kind of positional testing of your test- postions before running test- sets of openings for game- playing as well as for test suites, to get to know about there usability for the aim of choice, so what?

The questions have to be better defined maybe for suites than as for game- playing, and yes, the the answers are maybe even less transitive, but as results of their own (which all eng-eng-matches by game playing anyhow too) they are rather useful to me as well as eng-eng-matches of course are still too.
All I wrote, doesn't mean, I don't run eng-eng-games on my own neither, not to mention would disregard the results of well made game- playing ranking- lists.
Maybe there shouldn't always on and on some kind of competition be made between the results of game- playing and of positional testing, rather results should be seen as supplementary to each other.
Suites are only one small part of positional testing, which, by strict definition, game- playing out of opening- test- positions is as well too, just outplayed positional testing then. I for my personal pov. would e.g. start to let outplay test positions from midgames and endgames too now and then, otherwise opening- strength ist always tested much more then midgame- and endgame- strenght by eng-eng-game- playing always anyhow too.

Have to stop that here, always get somewhat effusive about that issue
:)
Peter.
Jouni
Posts: 3227
Joined: Wed Mar 08, 2006 8:15 pm

Re: Is there ANY useful testsuite?

Post by Jouni »

Thanks Peter. I will try your suites.
Jouni