'STS' Test Suite (v2.0): Open Files and Diagonals.. Released

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, chrisw, Rebel

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by bob »

Spock wrote:Crafty 23.0 x64 4CPU
Quad Core Opteron 1352
20 secs per move

80 of 100 matching moves
04/04/2009 23:11:27, Total time: 00:35:47
Rated time: 08:25 = 505 Seconds

I wonder if Bob came along with his 8-way Xeon whether Crafty would improve on that even further
yep, check my other post. :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by bob »

swami wrote:
bob wrote:I am running this for fun, but I do _not_ consider these a "positional test suite". For example, in undermine #2, c5 is a tactical move. +4.0 is not the mark of a great "positional test move".

I think that a positional test suite should be one where it is irrelevant what other programs think in general, they should be about ideas that are actually positional in nature and where tactics don't play a role at all.. For example, 1/2 of the original kopec-bratko test positions were pawn lever positions that were not tactical in nature...
Ok, I shall stop calling it a positional test suite, I'd instead call it "Open Files and Diagonals", "Undermining" by name only, from now on.

Looking forward to results from Crafty on that faster hardware you've in your possession.
already posted here...
swami
Posts: 6658
Joined: Thu Mar 09, 2006 4:21 am

Re: results

Post by swami »

Whoah. That's pretty very high score from Crafty. 8-)
swami
Posts: 6658
Joined: Thu Mar 09, 2006 4:21 am

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by swami »

Testing still in progress but results so far:

Q6600 2.4Ghz, 32 bits, All engines use 1 CPU.
Open Files and Diagonals.
10 sec each move:

Code: Select all

Fruit - 85
TwistedLogic - 80
Toga - 80


Bright - 79
ETChess - 78
Glaurung - 78
Hamsters - 76
The King TrailBlazer - 75
Movei - 74
Delfi - 72
Alaric - 72
Pharaon - 71


Zappa 1.1 - 69
Cerebro - 68
Scorpio - 68
Crafty - 67 
Chiron - 67
Tao - 67
Kiwi - 67
The Baron - 66
NOW - 66
Arion - 65
Aristarch - 65
Slowchess - 65
BugChess - 65
List512 - 65
Jonny - 64
Deep Patzer - 64
Alfil - 63
Pro Deo - 63
Natwarlal - 63
Queen - 62
Abrok - 62
Delphil - 62
LearningLemming - 61
Green Light Chess - 61
Comet - 61
Yace - 61
Gaia - 61
Lambchop - 61
Ufim - 60
Trace - 60


Arasan - 59
Asterisk - 59 
Nejmet - 59
Amyan - 58
Romichess - 57
King of Kings - 56
Rotor 56
Phalanx - 54
Pepito - 53
Knight Dreamer - 53
Horizon - 51
Alarm - 51
Booot - 50


Little Thought - 49
ZcT - 40
Bestia - 40 


Beowulf - 39
RDChess - 36
Last edited by swami on Sun Apr 05, 2009 6:47 pm, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: results

Post by bob »

swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Spock

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by Spock »

swami wrote:Testing still in progress but results so far:

Q6600 2.4Ghz, 32 bits, All engines use 1 CPU.
Open Files and Diagonals.
10 sec each move:

Code: Select all

Fruit - 85
TwistedLogic - 80
Toga - 80

<snip>

Great performance by Twisted Logic, it has really improved dramatically recently. Well done Edsel !
swami
Posts: 6658
Joined: Thu Mar 09, 2006 4:21 am

Re: results

Post by swami »

bob wrote:
swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Well, I have chosen only positions where the evaluation score for the best move is atleast > 0.20 more than the second best move and it's been verified after 5 hours of analysis by Dann. And these score difference are agreed on with by Rybka/Zappa/Naum in unison. Else they wouldn't pass the criteria.

You've a point that +4 scores in some tests are really tactical in nature, albeit there were only few such positions. I should cease to call the test suite positional. I should rather call it a puzzle where undermining occur. That would make more sense.

I don't trust GM's moves. I took a look into GM games database, I'm having a tough time trying to find any good positions, and it took me so long to come up with few. It's like sitting by the river trying to catch a fish, and there were hardly any.

Next day, I took a look into Rybka's games, I easily find many tests that could make into a good test suite. All I had to do was to check the score difference between the first best and the second best move from Rybka. And to see whether the position in question would qualify as "undermining" pattern. If all that qualifies, I send them to Dann, who would then run a deep analysis for hours with Top 3 engines, and if they all agree in unison, he'd put those tests into 'Qualified' list. That was fun, really.

I'd think that easier and quicker way to create more positions is from studying correspondence games, especially with the use of computers for days. I don't know where those games can be downloaded, but I've to ask around.

I do see some engines clearly doing better in undermining but doing fantastically bad in open files and diagonals. While others did better in the latter rather than the former. I'd hope to get the 3rd test suite ready. It's a good hobby, I should tell you, I really enjoyed every moment of it! :wink:
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by bob »

10 secs/move on 8-core nehalem:

Code: Select all

total positions searched..........         100
number right......................          90
number wrong......................          10
percentage right..................          90
percentage wrong..................          10
total nodes searched..............  2348666119
average search depth..............         6.3
nodes per second..................    14446217
total time........................        2&#58;42

same test again, but just using one thread/core:

Code: Select all

total positions searched..........         100
number right......................          85
number wrong......................          15
percentage right..................          85
percentage wrong..................          15
total nodes searched..............   506426758
average search depth..............         6.1
nodes per second..................     2343049
total time........................        3&#58;36

Note that this box is 2.26ghz, and shows no real speed improvement over the previous core-2 processor family at the same clock speed. At one thread, the triple-channel memory is useless for Crafty as it is not a memory-bandwidth hog, by design.

Not sure what kind of processor you are running on, but this is not a lot faster than my current core-2 2.0ghz laptop:

Code: Select all

total positions searched..........         100
number right......................          85
number wrong......................          15
percentage right..................          85
percentage wrong..................          15
total nodes searched..............   478852595
average search depth..............         6.0
nodes per second..................     2211381
total time........................        3&#58;36
The "total time" is really bounded by the number wrong, as this will always include number_wrong * 10secs + time for finding rest of right answers...

That last is a dell D620 2.0ghz laptop (core2 duo) using 1 cpu. 384mb of hash, which is the same that I used on all Nehalem runs for consistency...
swami
Posts: 6658
Joined: Thu Mar 09, 2006 4:21 am

Re: 'STS' Test Suite (v2.0): Open Files and Diagonals.. Rele

Post by swami »

Spock wrote:
swami wrote:Testing still in progress but results so far:

Q6600 2.4Ghz, 32 bits, All engines use 1 CPU.
Open Files and Diagonals.
10 sec each move:

Code: Select all

Fruit - 85
TwistedLogic - 80
Toga - 80

<snip>

Great performance by Twisted Logic, it has really improved dramatically recently. Well done Edsel !
I think this was the beta version of TwistedLogic, It's the version from July 29th. Edsel did say that these recent versions are very similar to last public version in strength. Perhaps I ought to run the 20th June public version to see how it fares in both the suites.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: results

Post by bob »

swami wrote:
bob wrote:
swami wrote:Whoah. That's pretty very high score from Crafty. 8-)
I think much of this is tactical in nature. What I've always looked for is positions such as 1. e4 c5 2. Nf3 any 3. d4 where d4 is a pretty obvious move to control the center, nullify c5 attacking d4, etc. There are other moves that are perfectly playable, but d4 strikes right to the crux of the position, without being a move that wins anything. The BK test pawn lever positions are similar. Either a program "gets it" or it doesn't. Depth is not particularly important although some require some depth to see the ultimate point of the correct move. I think the way you screened these is backward. I'd toss out positions where the best move scores significantly better than next-best, if you are using a computer to choose them. Some positional scores might well be .2 to .3 (if they don't include king safety issues) but most are a razor;s edge away from the second-best, which is what makes a GM's best move better than my best move.

I'll try to look at these in some detail when I have time to see which look like the kind of positional tests I'd like to keep for eval testing and tuning...
Well, I have chosen only positions where the evaluation score for the best move is atleast > 0.20 more than the second best move and it's been verified after 5 hours of analysis by Dann. And these score difference are agreed on with by Rybka/Zappa/Naum in unison. Else they wouldn't pass the criteria.

You've a point that +4 scores in some tests are really tactical in nature, albeit there were only few such positions. I should cease to call the test suite positional. I should rather call it a puzzle where undermining occur. That would make more sense.

I don't trust GM's moves. I took a look into GM games database, I'm having a tough time trying to find any good positions, and it took me so long to come up with few. It's like sitting by the river trying to catch a fish, and there were hardly any.

Next day, I took a look into Rybka's games, I easily find many tests that could make into a good test suite. All I had to do was to check the score difference between the first best and the second best move from Rybka. And to see whether the position in question would qualify as "undermining" pattern. If all that qualifies, I send them to Dann, who would then run a deep analysis for hours with Top 3 engines, and if they all agree in unison, he'd put those tests into 'Qualified' list. That was fun, really.

I'd think that easier and quicker way to create more positions is from studying correspondence games, especially with the use of computers for days. I don't know where those games can be downloaded, but I've to ask around.

I do see some engines clearly doing better in undermining but doing fantastically bad in open files and diagonals. While others did better in the latter rather than the former. I'd hope to get the 3rd test suite ready. It's a good hobby, I should tell you, I really enjoyed every moment of it! :wink:
This has been the "Holy Grail" of testing for years. It is a tough problem. One fairly good indicator is that faster hardware produces better results, while in a true positional test this would not be the case. Either you have the knowledge or you don't. For example, a position where you can take black's a-pawn and give yourself a "distant pawn majority" (turns into a distant passed pawn eventually) or you can take black's g pawn which weakens his pawns a bit but not nearly as much as the majority. The right position won't be depth-sensitive, it will simply determine whether the program understands majorities or not. A book like PPD or something similar might give some good positions...