Tony's positional test suite

Discussion of chess software programming and technical issues.

Moderators: hgm, chrisw, Rebel

Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Sample regression

Post by Ferdy »

pedrox wrote:
Ferdy wrote:

Code: Select all

A. Processor
Brand          : Intel(R) Celeron(R) CPU B800 @ 1.50GHz
Arch           : X86_64
Count          : 2

B. Engine settings
Threads        : 1
Hash (mb)      : 128
Time(s)/pos    : 30.0

C. Test set
Filename       : tony-dcc-caleb.epd
NumPos         : 16

D. Results
Engine                   : Rating   Best  Score  SRate  Elap(s)
Stockfish 8 64           :   3334     10     86   0.82      451
Fire 5 x64               :   3132      8     82   0.78      451
Komodo 9.02 64-bit       :   3200      8     75   0.71      450
Bobcat v8.0              :   2816      8     70   0.67      428
Texel 1.06               :   2947      7     69   0.66      451
Hannibal 1.7 x64         :   2981      8     67   0.64      451
Cheng 4.39               :   2785      6     67   0.64      451
Deuterium v2017.1.35.431 :   2760      6     63   0.60      451
Arasan 20.2              :   2880      5     62   0.59      450
Rhetoric 1.4.3 x64       :   2631      6     61   0.58      429
Ethereal 8.19            :   2506      7     59   0.56      451
spark-1.0                :   2778      5     58   0.55      450
Gaviota v1.0             :   2716      4     55   0.52      450
Alaric 707               :   2479      3     54   0.51      453
Arminius 2014-01-18      :   2346      4     53   0.50      450
Cheese 1.9 64 bits       :   2558      4     52   0.50      450
Maverick 1.5 x64         :   2380      3     43   0.41      451
Linear regression.
Estimated Rating = (2443 x ScoreRate) + 1306
ScoreRate = totalScore/maxScore

Image
For maxScore it seems that you have used 104, however adding on epd file I think I get 114.
Your observation is correct, there is a bug in my script, the last position (16th) was not included.
zenpawn
Posts: 349
Joined: Sat Aug 06, 2016 8:31 pm
Location: United States

Re: Sample regression

Post by zenpawn »

Dann Corbit wrote: Typically, there is a very poor regression between engine strength and EPD test suites.

I remember back in the day, when Shredder topped the Elo charts, it scored 285/300 on WAC which was very average.
It's encouraging to read this as RookieMonster had gotten up to nearly 60% on the STS, but recently dropped to 58% while simultaneously performing better than before in gauntlets against other engines. I kept those changes, but it was still disappointing to not see both measures improve.
first25plus5
Posts: 11
Joined: Sat Jul 22, 2017 2:50 am
Location: New Zealand

Re: Sample regression

Post by first25plus5 »

Excellent work many thanks for producing this.
User avatar
Rebel
Posts: 7257
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Tony's positional test suite

Post by Rebel »

Ferdy wrote:This is now fully converted. Duplicates are also removed. Illegal moves are discarded and not replaced, if there is only one move and it is illegal, the epd is removed.

Code: Select all

r3r1k1/1p3nqp/2pp4/p4p2/Pn3P1Q/2N4P/1PPR2P1/3R1BK1 w - - bm Ne2; c0 "positional scores are: Ne2=10, g4=6, Bd3=5, Rxd6=2, Re1=2, Qh5=1, Kh2=1, Be2=1"; id "rebel.pos.01";
4rrk1/pp1b2pp/5n2/3p1N2/8/2QB1qP1/PP3P1P/4RRK1 w - - bm Rxe8; c0 "positional scores are: Rxe8=10, Ne7+=7, Re3=6, Nd4=4"; id "rebel.pos.02";
r6r/p6p/1pnpkn2/q1p2p1p/2P5/2P1P3/P4PP1/1RBQKB1R w K - bm Rb3; c0 "positional scores are: Rb3=10, Qc2=7, Rxh5=7, Be2=7, Bd3=2, g4=2, e4=2, Rb5=1"; id "rebel.pos.03";
Download rebel.epd
https://drive.google.com/file/d/0BwAOsu ... sp=sharing

Sample run at 1s/pos

Code: Select all

A. Processor
Brand          : Intel(R) Celeron(R) CPU B800 @ 1.50GHz
Arch           : X86_64
Count          : 2

B. Engine settings
Threads        : 1
Hash (mb)      : 128
Time(s)/pos    : 1.0

C. Test set
Filename       : rebel.epd
NumPos         : 657

D. Results
Engine                   : Rating   Best  Score  SRate  Elap(s)

Stockfish 8 64           :   3334    345   3193   0.64      674
Deuterium v2017.1.35.431 :   2760    278   2650   0.53      673
Thanks for doing this. BTW, which interface (util) is used to run those EPD sets?
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Tony's positional test suite

Post by Ferdy »

Rebel wrote:Thanks for doing this. BTW, which interface (util) is used to run those EPD sets?
I am using a script. I hope to release it after improving some output and command line arguments.
first25plus5
Posts: 11
Joined: Sat Jul 22, 2017 2:50 am
Location: New Zealand

Re: Tony's positional test suite

Post by first25plus5 »

Something pointed out in Robin Smith’s book “Modern Chess Analysis” (Gambit books, 2004) are ‘ruler flat’ evaluations which indicate fortress draws. (or the evaluation tendency to ‘settle’ approximately so).
This evaluation behavior is further examined in a paper with later engines “Detecting Fortresses in Chess” (Guid & Bratko, 2012).
Example is if an evaluation eventually stabilizes at approximately say +2.24 and maintains this for some time then this behavior strongly indicates a fortress draw, despite a high evaluation for White.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Tony's positional test suite

Post by jwes »

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Tony's positional test suite

Post by zullil »

first25plus5 wrote:Something pointed out in Robin Smith’s book “Modern Chess Analysis” (Gambit books, 2004) are ‘ruler flat’ evaluations which indicate fortress draws. (or the evaluation tendency to ‘settle’ approximately so).
This evaluation behavior is further examined in a paper with later engines “Detecting Fortresses in Chess” (Guid & Bratko, 2012).
Example is if an evaluation eventually stabilizes at approximately say +2.24 and maintains this for some time then this behavior strongly indicates a fortress draw, despite a high evaluation for White.
From the article:
6 CONCLUSIONS

We introduce a novel idea for detecting fortresses in the
game of chess. We demonstrate that a heuristic-searchbased
program is able to detect fortresses on the basis of
backed-up values obtained at different levels of search.
If a particular position is a fortress, the program is not
able to show any progress towards a win and thus the
backed-up values cease to change significantly from a
certain search depth on.
Calling this idea "novel" in 2012 seems dubious, at best. :cry: Probably should not comment further...
BeyondCritics
Posts: 407
Joined: Sat May 05, 2012 2:48 pm
Full name: Oliver Roese

Re: Tony's positional test suite

Post by BeyondCritics »

Thank you for that.
I gleaned over the test suite with analysis and diagrams on the web (http://privat.bahnhof.se/wb432434/pos.htm), these are all open positions, except for #14. That means that in the remaining 15 positions stockfish should be irrefutable by humans. I checked that conjecture and indeed in 8(!) out of 15 cases the commentators got it wrong or backwards. How many points would you give for that??
I personally enjoyed this rebuttal the most:

[d]1rN1r1k1/1pq2pp1/2p1nn1p/p2p1B2/3P4/4P2P/PPQ1NPP1/2R2RK1 b - - 0 1

1..Rxbc8 2.Nf4 (allegedly the refutation) Nxf4! 3.Bxc8 Nxg2!

In #14 the alleged best move 1.Nb1, played by Kasparov, is neutralized outright by 1..b5 and black is well.
[d]r3r1k1/ppqbbpp1/2pp1nnp/3Pp3/2P1P3/5N1P/PPBN1PP1/R1BQR1K1 w - - 0 1


In #16 after 34.Qxc5 (stockfish) resigning is an option.
[d]2r2k2/5p2/2Bp1b1r/2qPp1pp/PpN1P3/1P2Q3/5PPP/4R1K1 w - - 0 1

Interestingly with the help of stockfish you might save even this position against a strong human master. Since after the 34. Rc1(?) Qxe3 35.Nxe3(?!) Bd8 36.Rc4(?!) Ba5 37.Nc2(?!) g4 38.Nxb4(??) it follows 38...Rb8 39.Bb5 Bxb4 40. Rxb4 f5! and white is only minimal better (stockfish).

Never trust your test suite.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Tony's positional test suite

Post by Evert »

zullil wrote: Calling this idea "novel" in 2012 seems dubious, at best. :cry: Probably should not comment further...
Yes... it's one of those things that make me wonder how it got past the referee. As it is, the paper points out some obvious points and proceeds to offer no real idea for how to handle fortress detection.
Saying that the engines "detect" the fortress by having a flat eval seems rather generous; I'd call not returning a draw score a sign of not detecting the fortress.
Still, the paper has a list of interesting fortress positions that I might use if/when I go back to tinkering with fortress detection.