Testing endgame strength

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Testing endgame strength

Post by AlvaroBegue »

Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Testing endgame strength

Post by Dann Corbit »

What exactly are you looking for?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Testing endgame strength

Post by AlvaroBegue »

Dann Corbit wrote:What exactly are you looking for?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?
I want to test how well the engine evaluates positions with 8 pieces or less. So perhaps starting with positions with 10 or 12 pieces would be good, but they should be "interesting", so it shouldn't be completely clear what the result will be, and there should be a representative variety of the endgames that do happen in games.

The only ideas I have for how to collect such a dataset seem very expensive, so I was hoping someone had a ready-made set, or perhaps just better ideas.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Testing endgame strength

Post by Laskos »

AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Testing endgame strength

Post by AlvaroBegue »

Laskos wrote: http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
Thanks, Kai! I'll try to use that and see if I get meaningful results.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Testing endgame strength

Post by AlvaroBegue »

I am testing the change of using w/d/l statistics from the database of positions I use in RuyTune to assign a value to material configurations with 7 or fewer pieces for which I have at least 31 samples.

With my usual set of opening positions, the benefit is lost in the noise:
1394-1355-2361, +3 Elo, LOS=0.771512

With Kai's End_02_05.epd so far I have:
270-215-380, +22 Elo, LOS: 0.993745

So indeed it looks like the signal to noise is much improved.

Thanks again!
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Testing endgame strength

Post by jwes »

Laskos wrote:
AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Testing endgame strength

Post by AlvaroBegue »

jwes wrote:
Laskos wrote:
AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.
Yes, that's the essence of how I wanted to build my own collection of positions. There seem to be quite a few positions like that in Kai's file, so maybe I'll start by filtering those out.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Testing endgame strength

Post by Laskos »

jwes wrote:
Laskos wrote:
AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.
That is a bit related, at least statistically, to the eval of Stockfish on these positions. In endgame, positions from roughly 0.3 to 1.7 in Stockfish eval have have high proportions of playable positions. From above 1.0-1.2, pentanomial variance is useful to be applied, and I can do it on the final result (not game by game, sadly Cutechess doesn't do it, but Richard Delorme started working on a tool).

From regular opening phase positions this 0.3-1.7 is smaller, even 0.0 positions are still playable to about 0.9, the efficiency then decreases (signal to noise ratio or t-value).

Also, depends very much too on strength of engines and time control.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Testing endgame strength

Post by Dann Corbit »

There are a few positions in 3moves_gm that are very polar. This appears to be the worst one... score is +210 for the side to move:

[d]rnb1kbnr/pppp1ppp/8/4P3/q7/5N2/PPP1PPPP/RNBQKB1R w KQkq - acd 37; acs 900; bm Nc3; cce 104; ce 210; pm Nc3 {30}; pv Nc3 Bb4; white_wins 18; black_wins 11; draws 1;
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.