In response to Uri's thread about positional understanding

Dann Corbit · Post by **Dann Corbit** » Thu Feb 08, 2018 1:11 am

I agree that statistics are extremely important.
That is why I have both a ce value (computer centipawns) and a cce value (computed centipawns as a function of wins/losses/draws) for each position.

If I find that the scores do not agree, and especially if they scores disagree and the computer evaluated move disagrees with the most frequent played move, then I reanalyze to greater and greater depth.

Eventually, I either get agreement or debunking or at least a better evaluation of the position.

Here is a sample query I used to look for one type of discrepancies:

Code: Select all

select   
round&#40;&#40;coef * 444.0 ),0&#41; as oce,
ce, 
round&#40;-&#40;coef * 444.0 - ce&#41;,0&#41; as distance, 
e.Epd, 
acd, 
pv, 
bm, 
e.pm, 
white_wins, 
black_wins, 
draws, 
&#40;white_wins+black_wins+draws&#41; as games, 
acs,
id, 
Opening 
from Epd e   
where  round&#40;-&#40;coef * 444.0 - ce&#41;,0&#41; < 0  /* Computer score is better */
AND abs&#40;ce&#41; < 30000 /* When the computer found a checkmate, who cares */
AND len&#40;Epd&#41; >= 43 /* Opening positions to early midgame */
AND games >= 27  /* I don't nodes that are not played much */
AND &#40;NOT dbo.GetFirstWord&#40;pv&#41; = dbo.GetFirstWord&#40;pm&#41;)  /* computer engine node not the same as most common actually played node */
AND acd < 37 /* For the current pass, I am looking at shallow data */
order by games desc, acd desc, acs desc, round&#40;-&#40;coef * 444.0 - ce&#41;,0&#41; , Epd

Laskos · Post by **Laskos** » Thu Feb 08, 2018 12:15 pm

matejst wrote:Thanks, Kai. Very interesting testing like always.

Could you write what version of Giraffe you used?

Then, there are a few engines I believe play a good positional brand of chess, so if you could test engines like Wasp and iCE, I would be very grateful.

Finally, did you test engines understanding of endings? I noticed that there is a trend of removing endgame knowledge lately, and I would be very interested in your findings.

Code: Select all

iCE 3.0 &#40;internal book enabled, 1 thread&#41;
score=690/1000 &#91;averages on correct positions&#58; depth=80.5 time=0.07 nodes=98175&#93;

Komodo 11.2.2 
score=666/1000 &#91;averages on correct positions&#58; depth=13.2 time=0.71 nodes=3853146&#93; 

Houdini 6.03 
score=656/1000 &#91;averages on correct positions&#58; depth=14.3 time=0.87 nodes=6708492&#93; 

Stockfish 9 
score=641/1000 &#91;averages on correct positions&#58; depth=14.0 time=0.74 nodes=4793389&#93; 

Stockfish 8 
score=628/1000 &#91;averages on correct positions&#58; depth=13.9 time=0.80 nodes=4802227&#93; 

Andscacs 0.93 
score=598/1000 &#91;averages on correct positions&#58; depth=12.2 time=0.69 nodes=3202933&#93; 

Shredder 13 
score=573/1000 &#91;averages on correct positions&#58; depth=14.3 time=0.79 nodes=4678511&#93; 

Texel 1.08a8 
score=489/1000 &#91;averages on correct positions&#58; depth=10.3 time=0.53 nodes=3053861&#93; 

iCE 3.0 &#40;no internal book, 1 thread&#41;
score=472/1000 &#91;averages on correct positions&#58; depth=11.8 time=0.73 nodes=992618&#93; 

Wasp 2.60
score=429/1000 &#91;averages on correct positions&#58; depth=8.9 time=0.50 nodes=2259584&#93;

Giraffe 161023 &#40;1 thread&#41;
score=410/1000 &#91;averages on correct positions&#58; depth=10.0 time=0.68 nodes=167994&#93; 

RomiChessP3n default &#40;1 thread&#41;
score=392/1000 &#91;averages on correct positions&#58; depth=11.7 time=0.88 nodes=4934412&#93;

Interesting results. I might start later today endgame experiments, some of them are straightforward.

Laskos · Post by **Laskos** » Thu Feb 08, 2018 12:38 pm

Dann Corbit wrote:I agree that statistics are extremely important.
That is why I have both a ce value (computer centipawns) and a cce value (computed centipawns as a function of wins/losses/draws) for each position.

If I find that the scores do not agree, and especially if they scores disagree and the computer evaluated move disagrees with the most frequent played move, then I reanalyze to greater and greater depth.

Eventually, I either get agreement or debunking or at least a better evaluation of the position.

Here is a sample query I used to look for one type of discrepancies:
Code: Select all
select   
round&#40;&#40;coef * 444.0 ),0&#41; as oce,
ce, 
round&#40;-&#40;coef * 444.0 - ce&#41;,0&#41; as distance, 
e.Epd, 
acd, 
pv, 
bm, 
e.pm, 
white_wins, 
black_wins, 
draws, 
&#40;white_wins+black_wins+draws&#41; as games, 
acs,
id, 
Opening 
from Epd e   
where  round&#40;-&#40;coef * 444.0 - ce&#41;,0&#41; < 0  /* Computer score is better */
AND abs&#40;ce&#41; < 30000 /* When the computer found a checkmate, who cares */
AND len&#40;Epd&#41; >= 43 /* Opening positions to early midgame */
AND games >= 27  /* I don't nodes that are not played much */
AND &#40;NOT dbo.GetFirstWord&#40;pv&#41; = dbo.GetFirstWord&#40;pm&#41;)  /* computer engine node not the same as most common actually played node */
AND acd < 37 /* For the current pass, I am looking at shallow data */
order by games desc, acd desc, acs desc, round&#40;-&#40;coef * 444.0 - ce&#41;,0&#41; , Epd

My problem is that say I would prefer for the openings to have 20 too hard for any engine and 20 wrong solutions than a total of say 30 wrong solutions which engines solve for wrong reasons at some VLTC. Top 3 engines nowadays are not that dissimilar, all have Sim results above 50% among themselves, there is some convergence of evals (well, not being suspect, say above 60-65%). So, they might go for the same wrong solution, and this is an overfitting of the test suite with regard to these engines. I used short time control engine analysis just to eliminate too easy positions for engines and to avoid some tactical positions. About very hard positional ones in the opening, I rely mostly on the statistics of outcomes in large databases.

matejst · Post by **matejst** » Thu Feb 08, 2018 7:48 pm

Once again, thank you, Kai. I'll be following this thread for more, of course.

Could you make your database available? I would like to test some engines myself but under different conditions. I usually test engines in QG and similar positions that are in my own repertoire, but I am curious to see what they can achieve in a set of positions of different nature, and I am sure that you have carefully chosen your openings.

Laskos · Post by **Laskos** » Fri Feb 09, 2018 11:29 am

matejst wrote:Once again, thank you, Kai. I'll be following this thread for more, of course.

Could you make your database available? I would like to test some engines myself but under different conditions. I usually test engines in QG and similar positions that are in my own repertoire, but I am curious to see what they can achieve in a set of positions of different nature, and I am sure that you have carefully chosen your openings.

Well, the suite is not that high quality, openings are not even very representative. It's a bit a sloppy 1-week job I have done, but I am still not that discontent with this positional opening suite.
Here is beta7 200 opening suite:
http://s000.tinyupload.com/?file_id=699 ... 7085783885

I have also tested the late endgame phase.

1/ Hard 6-men wins
Timecontrol : 0.5 s/move.

Code: Select all

Rank Name                          ELO     +/-   Games   Score   Draws
   1 SF_6_men                       92      19    1000     63%     26%
   2 SF                              7      18    1000     51%     30%
   3 Houdini                        -8      18    1000     49%     31%
   4 Shredder                      -18      18    1000     47%     33%
   5 Komodo                        -26      18    1000     46%     34%
   6 Andscacs                      -45      18    1000     44%     34%
Finished match

SF_6men is SF with 6-men Syzygy, a perfect player. All other entries have no TBs enabled. The surprise is the weak performance of Komodo on these hard 6-men wins, and pretty strong performance of Shredder. The correct pentanomial error margins are 2-3 times smaller than the trinomial ones shown in Cutechess.
It's still not that clear if these wins are mostly tactical or positional.

2/ 6-men Fortresses
These are 11 6-men fortresses of Ferdinand Mosca:

Code: Select all

6k1/8/6PP/3B1K2/8/2b5/8/8 b - - 0 1 
8/8/r5kP/6P1/1R3K2/8/8/8 w - - 0 1 
7k/R7/7P/6K1/8/8/2b5/8 w - - 0 1 
8/8/5k2/8/8/4qBB1/6K1/8 w - - 0 1 
8/8/8/3K4/8/4Q3/2p5/1k6 w - - 0 1 
8/8/4nn2/4k3/8/Q4K2/8/8 w - - 0 1 
8/k7/p7/Pr6/K1Q5/8/8/8 w - - 0 1 
k7/p4R2/P7/1K6/8/6b1/8/8 w - - 0 1
6k1/6Pp/7P/8/3BK3/8/8/8 w - - 0 1 
6k1/7p/5K1P/8/8/7P/6P1/8 w - - 0 1 
8/8/8/6k1/2q3p1/4R3/5PK1/8 w - - 0 1

The mistakes here are more positional than tactical, I played games from these openings at 0.5 s/move on 2 threads each engine (for more diversity, only 11 opening positions):

Code: Select all

Rank Name                          ELO     +/-   Games   Score   Draws
   1 SF_6_men                       38      24     100     56%     87%
   2 SF                             24      24     100     54%     87%
   3 Shredder                       10      26     100     52%     85%
   4 Houdini                        -3      28     100     50%     83%
   5 Komodo                        -24      29     100     46%     81%
   6 Andscacs                      -45      29     100     44%     81%
Finished match

SF_6men is the perfect player here. Again, weak performance of Komodo, stronger than expected of Shredder, as before. This hints that both results show simply the late endgame understanding of engines.

matejst · Post by **matejst** » Fri Feb 09, 2018 1:34 pm

I downloaded the database. Thanks.

How much do you think that depth influence the evaluation in the endgame testing (you gave the average depth for the openings)?

Jouni · Post by **Jouni** » Sat Feb 10, 2018 1:21 pm

I tried Kai's original 200 position set with 3 top engines. Interestingly ALL get almost same score/time with 1 and 4 cores! So it's really positional test? But I guess SF is simply better than opening theory in same openings. Maybe it's better to PLAY engines from these openings. But I know already which engine wins

.

matejst · Post by **matejst** » Sat Feb 10, 2018 1:52 pm

Dear Jouni,

It is indeed a positional test set, imho. But testing automatically, with short times control can distort a bit the results. A human look is also important here. I tested Wasp 2.6 and Komodo 9 in the first 20 positions, and I noticed that Komodo was better -- or Wasp worse -- in dynamic positions, where the engine had to evaluate the quality of a pawn sacrifice, e.g. I am just about the beginning of this comparison, and I am not sure when it will be finished, but the difference in eval seem to be smaller at longer time control, something that should be expected.

The character of the chosen positions is a solid mix of quiet and dynamic, and it can reveal the weaknesses of tested engines. I can see -- but it was clear already -- that Wasp, just like Zarkov -- underestimates tactical threats.

I will continue with iCE, The Baron and SF, and I am very interested to see, in difficult positions, how deep the engines have to go to find one of the solutions.

Laskos · Post by **Laskos** » Sun Mar 18, 2018 1:36 pm

Laskos wrote:

matejst wrote:Thanks, Kai. Very interesting testing like always.

Could you write what version of Giraffe you used?

Then, there are a few engines I believe play a good positional brand of chess, so if you could test engines like Wasp and iCE, I would be very grateful.

Finally, did you test engines understanding of endings? I noticed that there is a trend of removing endgame knowledge lately, and I would be very interested in your findings.

Code: Select all

iCE 3.0 &#40;internal book enabled, 1 thread&#41;
score=690/1000 &#91;averages on correct positions&#58; depth=80.5 time=0.07 nodes=98175&#93;

Komodo 11.2.2 
score=666/1000 &#91;averages on correct positions&#58; depth=13.2 time=0.71 nodes=3853146&#93; 

Houdini 6.03 
score=656/1000 &#91;averages on correct positions&#58; depth=14.3 time=0.87 nodes=6708492&#93; 

Stockfish 9 
score=641/1000 &#91;averages on correct positions&#58; depth=14.0 time=0.74 nodes=4793389&#93; 

Stockfish 8 
score=628/1000 &#91;averages on correct positions&#58; depth=13.9 time=0.80 nodes=4802227&#93; 

Andscacs 0.93 
score=598/1000 &#91;averages on correct positions&#58; depth=12.2 time=0.69 nodes=3202933&#93; 

Shredder 13 
score=573/1000 &#91;averages on correct positions&#58; depth=14.3 time=0.79 nodes=4678511&#93; 

Texel 1.08a8 
score=489/1000 &#91;averages on correct positions&#58; depth=10.3 time=0.53 nodes=3053861&#93; 

iCE 3.0 &#40;no internal book, 1 thread&#41;
score=472/1000 &#91;averages on correct positions&#58; depth=11.8 time=0.73 nodes=992618&#93; 

Wasp 2.60
score=429/1000 &#91;averages on correct positions&#58; depth=8.9 time=0.50 nodes=2259584&#93;

Giraffe 161023 &#40;1 thread&#41;
score=410/1000 &#91;averages on correct positions&#58; depth=10.0 time=0.68 nodes=167994&#93; 

RomiChessP3n default &#40;1 thread&#41;
score=392/1000 &#91;averages on correct positions&#58; depth=11.7 time=0.88 nodes=4934412&#93;

Interesting results. I might start later today endgame experiments, some of them are straightforward.

I also tested the newest incarnations of top engines, they do seem to improve a bit. I am pretty satisfied with the suite, it will flatten probably at some 850/1000 or so, and the results will become pretty irrelevant, but it's a long way to that. For now, it seems to work, it is a difficult suite for all engines, and it is not hyper-analyzed solely by engines. The standard deviation of the results is 9-10 points, so for now Komodo 11.3.1 seems definitely the best at this opening suite, excluding engines containing an opening book (iCE 3.0).

Code: Select all

iCE 3.0 &#40;internal book enabled, 1 thread&#41;
score=690/1000 &#91;averages on correct positions&#58; depth=80.5 time=0.07 nodes=98175&#93;

Komodo 11.3.1
score=674/1000 &#91;averages on correct positions&#58; depth=13.3 time=0.83 nodes=3971363&#93;

Komodo 11.2.2 
score=666/1000 &#91;averages on correct positions&#58; depth=13.2 time=0.71 nodes=3853146&#93; 

Houdini 6.03 
score=656/1000 &#91;averages on correct positions&#58; depth=14.3 time=0.87 nodes=6708492&#93; 

BrainFish 180313
score=650/1000 &#91;averages on correct positions&#58; depth=13.3 time=0.76 nodes=3993087&#93;

Stockfish 9 
score=641/1000 &#91;averages on correct positions&#58; depth=14.0 time=0.74 nodes=4793389&#93; 

Stockfish 8 
score=628/1000 &#91;averages on correct positions&#58; depth=13.9 time=0.80 nodes=4802227&#93; 

Andscacs 0.93 
score=598/1000 &#91;averages on correct positions&#58; depth=12.2 time=0.69 nodes=3202933&#93; 

Shredder 13 
score=573/1000 &#91;averages on correct positions&#58; depth=14.3 time=0.79 nodes=4678511&#93; 

Texel 1.08a8 
score=489/1000 &#91;averages on correct positions&#58; depth=10.3 time=0.53 nodes=3053861&#93; 

iCE 3.0 &#40;no internal book, 1 thread&#41;
score=472/1000 &#91;averages on correct positions&#58; depth=11.8 time=0.73 nodes=992618&#93; 

Wasp 2.60
score=429/1000 &#91;averages on correct positions&#58; depth=8.9 time=0.50 nodes=2259584&#93;

Giraffe 161023 &#40;1 thread&#41;
score=410/1000 &#91;averages on correct positions&#58; depth=10.0 time=0.68 nodes=167994&#93; 

RomiChessP3n default &#40;1 thread&#41;
score=392/1000 &#91;averages on correct positions&#58; depth=11.7 time=0.88 nodes=4934412&#93;

In response to Uri's thread about positional understanding

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi

Re: In response to Uri's thread about positional understandi