Tony's positional test suite

Ferdy · Post by **Ferdy** » Sun Jul 30, 2017 7:34 pm

pedrox wrote:

Ferdy wrote:

A. Processor
Brand          &#58; Intel&#40;R&#41; Celeron&#40;R&#41; CPU B800 @ 1.50GHz
Arch           &#58; X86_64
Count          &#58; 2

B. Engine settings
Threads        &#58; 1
Hash &#40;mb&#41;      &#58; 128
Time&#40;s&#41;/pos    &#58; 30.0

C. Test set
Filename       &#58; tony-dcc-caleb.epd
NumPos         &#58; 16

D. Results
Engine                   &#58; Rating   Best  Score  SRate  Elap&#40;s&#41;
Stockfish 8 64           &#58;   3334     10     86   0.82      451
Fire 5 x64               &#58;   3132      8     82   0.78      451
Komodo 9.02 64-bit       &#58;   3200      8     75   0.71      450
Bobcat v8.0              &#58;   2816      8     70   0.67      428
Texel 1.06               &#58;   2947      7     69   0.66      451
Hannibal 1.7 x64         &#58;   2981      8     67   0.64      451
Cheng 4.39               &#58;   2785      6     67   0.64      451
Deuterium v2017.1.35.431 &#58;   2760      6     63   0.60      451
Arasan 20.2              &#58;   2880      5     62   0.59      450
Rhetoric 1.4.3 x64       &#58;   2631      6     61   0.58      429
Ethereal 8.19            &#58;   2506      7     59   0.56      451
spark-1.0                &#58;   2778      5     58   0.55      450
Gaviota v1.0             &#58;   2716      4     55   0.52      450
Alaric 707               &#58;   2479      3     54   0.51      453
Arminius 2014-01-18      &#58;   2346      4     53   0.50      450
Cheese 1.9 64 bits       &#58;   2558      4     52   0.50      450
Maverick 1.5 x64         &#58;   2380      3     43   0.41      451

Linear regression.
Estimated Rating = (2443 x ScoreRate) + 1306
ScoreRate = totalScore/maxScore

For maxScore it seems that you have used 104, however adding on epd file I think I get 114.

Your observation is correct, there is a bug in my script, the last position (16th) was not included.

zenpawn · Post by **zenpawn** » Sun Jul 30, 2017 8:11 pm

Dann Corbit wrote: Typically, there is a very poor regression between engine strength and EPD test suites.

I remember back in the day, when Shredder topped the Elo charts, it scored 285/300 on WAC which was very average.

It's encouraging to read this as RookieMonster had gotten up to nearly 60% on the STS, but recently dropped to 58% while simultaneously performing better than before in gauntlets against other engines. I kept those changes, but it was still disappointing to not see both measures improve.

first25plus5 · Post by **first25plus5** » Mon Jul 31, 2017 6:13 am

Excellent work many thanks for producing this.

Rebel · Post by **Rebel** » Mon Jul 31, 2017 7:35 am

Ferdy wrote:This is now fully converted. Duplicates are also removed. Illegal moves are discarded and not replaced, if there is only one move and it is illegal, the epd is removed.

Code: Select all

r3r1k1/1p3nqp/2pp4/p4p2/Pn3P1Q/2N4P/1PPR2P1/3R1BK1 w - - bm Ne2; c0 "positional scores are&#58; Ne2=10, g4=6, Bd3=5, Rxd6=2, Re1=2, Qh5=1, Kh2=1, Be2=1"; id "rebel.pos.01";
4rrk1/pp1b2pp/5n2/3p1N2/8/2QB1qP1/PP3P1P/4RRK1 w - - bm Rxe8; c0 "positional scores are&#58; Rxe8=10, Ne7+=7, Re3=6, Nd4=4"; id "rebel.pos.02";
r6r/p6p/1pnpkn2/q1p2p1p/2P5/2P1P3/P4PP1/1RBQKB1R w K - bm Rb3; c0 "positional scores are&#58; Rb3=10, Qc2=7, Rxh5=7, Be2=7, Bd3=2, g4=2, e4=2, Rb5=1"; id "rebel.pos.03";

Download rebel.epd
https://drive.google.com/file/d/0BwAOsu ... sp=sharing

Sample run at 1s/pos

Code: Select all

A. Processor
Brand          &#58; Intel&#40;R&#41; Celeron&#40;R&#41; CPU B800 @ 1.50GHz
Arch           &#58; X86_64
Count          &#58; 2

B. Engine settings
Threads        &#58; 1
Hash &#40;mb&#41;      &#58; 128
Time&#40;s&#41;/pos    &#58; 1.0

C. Test set
Filename       &#58; rebel.epd
NumPos         &#58; 657

D. Results
Engine                   &#58; Rating   Best  Score  SRate  Elap&#40;s&#41;

Stockfish 8 64           &#58;   3334    345   3193   0.64      674
Deuterium v2017.1.35.431 &#58;   2760    278   2650   0.53      673

Thanks for doing this. BTW, which interface (util) is used to run those EPD sets?

Ferdy · Post by **Ferdy** » Mon Jul 31, 2017 9:46 am

Rebel wrote:Thanks for doing this. BTW, which interface (util) is used to run those EPD sets?

I am using a script. I hope to release it after improving some output and command line arguments.

first25plus5 · Post by **first25plus5** » Mon Jul 31, 2017 11:31 pm

Something pointed out in Robin Smith’s book “Modern Chess Analysis” (Gambit books, 2004) are ‘ruler flat’ evaluations which indicate fortress draws. (or the evaluation tendency to ‘settle’ approximately so).
This evaluation behavior is further examined in a paper with later engines “Detecting Fortresses in Chess” (Guid & Bratko, 2012).
Example is if an evaluation eventually stabilizes at approximately say +2.24 and maintains this for some time then this behavior strongly indicates a fortress draw, despite a high evaluation for White.

jwes · Post by **jwes** » Tue Aug 01, 2017 1:17 am

https://ailab.si/matej/doc/Detecting_Fo ... _Chess.pdf

zullil · Post by **zullil** » Tue Aug 01, 2017 12:12 pm

first25plus5 wrote:Something pointed out in Robin Smith’s book “Modern Chess Analysis” (Gambit books, 2004) are ‘ruler flat’ evaluations which indicate fortress draws. (or the evaluation tendency to ‘settle’ approximately so).
This evaluation behavior is further examined in a paper with later engines “Detecting Fortresses in Chess” (Guid & Bratko, 2012).
Example is if an evaluation eventually stabilizes at approximately say +2.24 and maintains this for some time then this behavior strongly indicates a fortress draw, despite a high evaluation for White.

From the article:

6 CONCLUSIONS

We introduce a novel idea for detecting fortresses in the
game of chess. We demonstrate that a heuristic-searchbased
program is able to detect fortresses on the basis of
backed-up values obtained at different levels of search.
If a particular position is a fortress, the program is not
able to show any progress towards a win and thus the
backed-up values cease to change significantly from a
certain search depth on.

Calling this idea "novel" in 2012 seems dubious, at best.

Probably should not comment further...

BeyondCritics · Post by **BeyondCritics** » Tue Aug 01, 2017 11:17 pm

Thank you for that.
I gleaned over the test suite with analysis and diagrams on the web (http://privat.bahnhof.se/wb432434/pos.htm), these are all open positions, except for #14. That means that in the remaining 15 positions stockfish should be irrefutable by humans. I checked that conjecture and indeed in 8(!) out of 15 cases the commentators got it wrong or backwards. How many points would you give for that??
I personally enjoyed this rebuttal the most:

[d]1rN1r1k1/1pq2pp1/2p1nn1p/p2p1B2/3P4/4P2P/PPQ1NPP1/2R2RK1 b - - 0 1

1..Rxbc8 2.Nf4 (allegedly the refutation) Nxf4! 3.Bxc8 Nxg2!

In #14 the alleged best move 1.Nb1, played by Kasparov, is neutralized outright by 1..b5 and black is well.
[d]r3r1k1/ppqbbpp1/2pp1nnp/3Pp3/2P1P3/5N1P/PPBN1PP1/R1BQR1K1 w - - 0 1

In #16 after 34.Qxc5 (stockfish) resigning is an option.
[d]2r2k2/5p2/2Bp1b1r/2qPp1pp/PpN1P3/1P2Q3/5PPP/4R1K1 w - - 0 1

Interestingly with the help of stockfish you might save even this position against a strong human master. Since after the 34. Rc1(?) Qxe3 35.Nxe3(?!) Bd8 36.Rc4(?!) Ba5 37.Nc2(?!) g4 38.Nxb4(??) it follows 38...Rb8 39.Bb5 Bxb4 40. Rxb4 f5! and white is only minimal better (stockfish).

Never trust your test suite.

Evert · Post by **Evert** » Thu Aug 03, 2017 10:28 am

zullil wrote: Calling this idea "novel" in 2012 seems dubious, at best. Probably should not comment further...

Yes... it's one of those things that make me wonder how it got past the referee. As it is, the paper points out some obvious points and proceeds to offer no real idea for how to handle fortress detection.
Saying that the engines "detect" the fortress by having a flat eval seems rather generous; I'd call not returning a draw score a sign of not detecting the fortress.
Still, the paper has a list of interesting fortress positions that I might use if/when I go back to tinkering with fortress detection.

Tony's positional test suite

Re: Sample regression

Re: Sample regression

Re: Sample regression

Re: Tony's positional test suite

Re: Tony's positional test suite

Re: Tony's positional test suite

Re: Tony's positional test suite

Re: Tony's positional test suite

Re: Tony's positional test suite

Re: Tony's positional test suite