STS 1-8 12.26.09

kingliveson · Post by **kingliveson** » Sun Dec 27, 2009 5:42 am

STS 1 - 8 Test Suites 12.26.09

                              1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  Total
									
Rybka 3 x64	                96	89	90	89	91	91	88	79	713
RobboLito 0085g3 x64	       95	89	89	86	89	84	81	71	684
Naum 4 x64	                 90	88	82	81	82	81	75	68	647
Stockfish 1.6-ja x64	       90	88	81	87	82	79	73	64	644
Deep Shredder 12 x64	       83	79	74	77	79	74	83	71	620
Zappa Mexico II x64	        88	78	77	72	85	69	71	64	604

1.0: Undermining
2.0: Open Files and Diagonals
3.0: Knight Outposts
4.0: Square Vacancy
5.0: Bishop vs Knight
6.0: Re-Capturing
7.0: Offer of Simplification
8.0: Advancement of f/g/h pawns

Conditions:

-Arena 2.01 GUI
-10 seconds per position
-64 MB hash
-All engines use 1 CPU
-AMD Phenom II @ 3.6 GHz

See Analysis.log file for more details.

The test suites are available at http://sites.google.com/site/strategictestsuite/

swami · Post by **swami** » Sun Dec 27, 2009 7:13 am

Thanks for running the tests. Very interesting results, I would think there's need for 20 more STS suites in order to lessen the error probability.

Though 3.Naum 4.Stockfish 5.Shredder 6.Zappa

seems to be in correct order.

Uri Blass · Post by **Uri Blass** » Sun Dec 27, 2009 7:20 am

swami wrote:Thanks for running the tests. Very interesting results, I would think there's need for 20 more STS suites in order to lessen the error probability.

Though 3.Naum 4.Stockfish 5.Shredder 6.Zappa

seems to be in correct order.

Not exactly

Stockfish seems to be stronger than Naum

Uri

swami · Post by **swami** » Sun Dec 27, 2009 7:26 am

Uri Blass wrote:
swami wrote:Thanks for running the tests. Very interesting results, I would think there's need for 20 more STS suites in order to lessen the error probability.

Though 3.Naum 4.Stockfish 5.Shredder 6.Zappa

seems to be in correct order.

Not exactly

Stockfish seems to be stronger than Naum

Uri

Perhaps slightly stronger. There's still a need for 20 more suites for perfect comparison. Atleast partial idea can be gotten from these suites. For example, Shredder and Zappa are in right place, Amyan is the best in the Division 4 which I had organized. etc

Also, there's individual importance of certain test suite that plays a great role than an overall score. Stockfish scored a lot more than Naum in "Square Vacancy" which by the way is most important strategical theme than say, Knight outposts/Offer of Simplification etc...

kingliveson · Post by **kingliveson** » Sun Dec 27, 2009 8:03 am

swami wrote:
Uri Blass wrote:
swami wrote:Thanks for running the tests. Very interesting results, I would think there's need for 20 more STS suites in order to lessen the error probability.

Though 3.Naum 4.Stockfish 5.Shredder 6.Zappa

seems to be in correct order.

Not exactly

Stockfish seems to be stronger than Naum

Uri
Perhaps slightly stronger. There's still a need for 20 more suites for perfect comparison.

Atleast partial idea can be gotten from these suites. For example, Shredder and Zappa are in right place, Amyan is the best in the Division 4 which I had organized. etc

Also, there's individual importance that plays a great role than an overall score.

Stockfish scored a lot more than Naum in "Square Vacancy" which by the way is most important strategical theme than say, Knight outposts/Offer of Simplification etc...

Thanks for the strategy test suites--it must be a lot of work. I am a bit surprised by the scores...am not sure what could account for the final outcome. As for Stockfish 1.6 being stronger than Naum 4, it may seem so, but I think more games are needed. I am running a series of tournaments with no book, and Stockfish is showing great promise.

mcostalba · Post by **mcostalba** » Sun Dec 27, 2009 9:06 am

swami wrote: Also, there's individual importance of certain test suite that plays a great role than an overall score.

Yes, this is a sensible point. If you really want to get an idea of engine strength from tests scores then I would think you need to weight the scores according to their importance.

Should be not difficult to find the weights because you can use the official rarting lists as reference and modify the score weights until the weighted tests results reflect (more or less) the official lists.

This could be also interesting to see what is more important in chess playing among the various subjects.

kingliveson · Post by **kingliveson** » Sun Dec 27, 2009 11:29 pm

I am running this test again. I would like to see how consistent the results are. The only change is hash size from 64 MB to 128 MB. I should have the results in a few hours.

swami · Post by **swami** » Mon Dec 28, 2009 3:37 am

kingliveson wrote:I am running this test again. I would like to see how consistent the results are. The only change is hash size from 64 MB to 128 MB. I should have the results in a few hours.

Thanks, I'm looking forward to that!

kingliveson · Post by **kingliveson** » Mon Dec 28, 2009 6:42 am

STS 1 - 8 Short 12.27.09

Code: Select all

                              1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  Total
									
Rybka 3 x64	                95	89	92	86	90	93	87	78	710
RobboLito 0085g3 x64	       94	90	91	85	82	86	84	71	683
Naum 4 x64	                 89	86	89	84	81	81	83	73	666
Stockfish 1.6-ja x64	       91	89	77	88	82	79	76	64	646
Deep Shredder 12 x64	       85	77	76	80	82	71	75	72	618
Zappamexico II x64	         91	77	77	73	86	69	73	64	610

1.0: Undermining
2.0: Open Files and Diagonals
3.0: Knight Outposts
4.0: Square Vacancy
5.0: Bishop vs Knight
6.0: Re-Capturing
7.0: Offer of Simplification
8.0: Advancement of f/g/h pawns

Conditions:

-Arena 2.01 GUI
-10 seconds per position
-128 MB hash
-All engines use 1 CPU
-AMD Phenom II 940 @ 3.6 GHz

See Analyses.log file for more details.

The test suites are available at http://sites.google.com/site/strategictestsuite/

swami · Post by **swami** » Mon Dec 28, 2009 6:47 am

Thank you for the test. The design is interesting.

The rank order remains the same with more hash but we learn that Naum tends to plays better with more hash values. It's most affected by hash sizes relative to the rest of the engines.

Other Points to consider:

8 tests obviously wouldn't give clear picture. Though it could give rough strength of various engines.

More hash values will have effect only if the time control is more. For 10 seconds something less hash values works as it makes the search faster.

These tests are obviously strategical. So perhaps Stockfish is better at Tactics than Strategy than Naum.

STS 1-8 12.26.09

STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09

Re: STS 1-8 12.26.09