Opening testing suites efficiency

Michel · Post by **Michel** » Mon Jun 26, 2017 9:16 am

3moves_Elo2200.epd needed about 3x more games.

The number of games required by an SPRT is very variable. You cannot draw any conclusions from a single trial.

It would be like comparing two engines on the basis of one game.

Laskos · Post by **Laskos** » Mon Jun 26, 2017 9:18 am

brtzsnr wrote:Here is the first data point

Code: Select all

2moves_v1.pgn
Score of ./zurichess vs ./master&#58; 655 - 693 - 1132  &#91;0.492&#93; 2480
Elo difference&#58; -5.32 +/- 10.07
SPRT&#58; llr -1.89, lbound -1.87, ubound 3.34 - H0 was accepted
Finished match

3moves_Elo2200.epd
Score of ./zurichess vs ./master&#58; 1822 - 1814 - 3614  &#91;0.501&#93; 7250
Elo difference&#58; 0.38 +/- 5.66
SPRT&#58; llr -1.88, lbound -1.87, ubound 3.34 - H0 was accepted
Finished match

The bounds I use are

Code: Select all

Elo0&#58; 0.00 Elo1&#58; 6.00
Alpha&#58; 0.03 Beta&#58; 0.15
LLR&#58;-?.?? &#91;-1.87&#58;+3.34

3moves_Elo2200.epd needed about 3x more games.

Interesting. One run says little, observe the confidence intervals, but if SPRT stops will systematically show lower games for stop with 2 moves_v1, then it is probably better as signal to noise ratio. I simulated in the past these SPRT stops, they vary in number of games a lot from run to run. But it is possible that for small ELO differences, my test-suite behaves differently from my tests with huge Elo differences from doubling in time.

Dariusz Orzechowski · Post by **Dariusz Orzechowski** » Mon Jun 26, 2017 7:39 pm

If someone is interested, I've created a 5ply (2.5 moves) book in similar manner as my 2moves_v1 book. It contains over 97 thousands positions so there is plenty of room to cut it down and optimize for better properties.

As for "reasonable" openings - this is very vague term. What would be a rule to filter out "unreasonable" ones? I cannot think of anything good now. That given opening is not played by humans is not good enough argument. In go, after recent AlphaGo matches people started to play openings they long deemed unreasonable or just bad (for example very early 3-3 point invasion). Now they think it's fine to play like that.

5ply_v1.epd book (link expires in 7 days): http://dropmefiles.com/3jk3U

Michel · Post by **Michel** » Tue Jun 27, 2017 10:21 am

Dariusz Orzechowski wrote:If someone is interested, I've created a 5ply (2.5 moves) book in similar manner as my 2moves_v1 book. It contains over 97 thousands positions so there is plenty of room to cut it down and optimize for better properties.

As for "reasonable" openings - this is very vague term. What would be a rule to filter out "unreasonable" ones? I cannot think of anything good now. That given opening is not played by humans is not good enough argument. In go, after recent AlphaGo matches people started to play openings they long deemed unreasonable or just bad (for example very early 3-3 point invasion). Now they think it's fine to play like that.

5ply_v1.epd book (link expires in 7 days): http://dropmefiles.com/3jk3U

It seems like an interesting research question to find a method to improve the signal to noise ratio of a book. The SNR can be measured with Kai's method (a self match with time odds).

Laskos · Post by **Laskos** » Tue Jun 27, 2017 12:13 pm

brtzsnr wrote:Here is the first data point

Code: Select all

2moves_v1.pgn
Score of ./zurichess vs ./master&#58; 655 - 693 - 1132  &#91;0.492&#93; 2480
Elo difference&#58; -5.32 +/- 10.07
SPRT&#58; llr -1.89, lbound -1.87, ubound 3.34 - H0 was accepted
Finished match

3moves_Elo2200.epd
Score of ./zurichess vs ./master&#58; 1822 - 1814 - 3614  &#91;0.501&#93; 7250
Elo difference&#58; 0.38 +/- 5.66
SPRT&#58; llr -1.88, lbound -1.87, ubound 3.34 - H0 was accepted
Finished match

The bounds I use are

Code: Select all

Elo0&#58; 0.00 Elo1&#58; 6.00
Alpha&#58; 0.03 Beta&#58; 0.15
LLR&#58;-?.?? &#91;-1.87&#58;+3.34

3moves_Elo2200.epd needed about 3x more games.

I would like to add that this way, testing almost equal engines, you will need gazillion games to see the difference in sensitivity between opening suites with SPRT stops. I have taken well separated by some 50 ELO points engines:

stockfish_21062017
stockfish_14092016

SPRT stop with these parameters (10''+0.1''):

Elo0: 0.00 Elo1: 30.00
Alpha: 0.03 Beta: 0.15
(by the way, congratulations for using different alpha, beta, as Type I and Type II errors should be treated differently)

The number of games in 20 runs each suite were the following:

2moves_v1:

Code: Select all

&#123;164, 167, 60, 149, 435, 315, 180, 252, 269, 140, 307, 150, 108, 218, 138, 230, 419, 185, 128, 167&#125;

Average&#58; 209 +/- 42 games
Median&#58; 173.5 games

3moves_Elo2200:

Code: Select all

&#123;304, 124, 353, 152, 61, 196, 103, 145, 63, 192, 158, 157, 287, 244, 386, 86, 373, 200, 71, 382&#125; 

Average&#58; 202 +/- 47 games
Median&#58; 175 games

These 20 runs were already time consuming, and even they don't say very much, within error margins. The way to compare suites is by LOS (p-value) and Normalized ELO, not SPRT stops, which require much more effort.

My goal was anyway not so much in increasing sensitivity, as to have some normal openings with comparable sensitivity. I would be satisfied if the sensitivity is on par with 2moves_v1.

Dariusz Orzechowski · Post by **Dariusz Orzechowski** » Tue Jun 27, 2017 4:32 pm

I've made a quick test with Stockfish_210617_2x playing with tc 6+0.06 and Stockfish_210617_1x with 3+0.03. Cutechess-cli, 1 thread, no adjudication rules. I'm not sure how to calculate "normalized elo", could someone point me to a formula? 2000 games seems anyway too small sample to draw good conclusions.

2moves_v1.epd

Code: Select all

Score of Stockfish_210617_2x vs Stockfish_210617_1x&#58; 947 - 101 - 952  &#91;0.712&#93; 2000
ELO difference&#58; 156.81 +/- 10.88

5ply_v1.epd

Code: Select all

Score of Stockfish_210617_2x vs Stockfish_210617_1x&#58; 976 - 113 - 911  &#91;0.716&#93; 2000
ELO difference&#58; 160.42 +/- 11.19

3moves_Elo2200.epd

Code: Select all

Score of Stockfish_210617_2x vs Stockfish_210617_1x&#58; 910 - 82 - 1008  &#91;0.707&#93; 2000
ELO difference&#58; 153.02 +/- 10.46

Michel · Post by **Michel** » Tue Jun 27, 2017 5:19 pm

This code computes normalized elo (aka sensitivity, SNR, ...) with 95% confidence interval.

Code: Select all

from __future__ import division
def sens&#40;W=None,D=None,L=None&#41;&#58;
    N=W+D+L
    &#40;w,d,l&#41;=&#40;W/N,D/N,L/N&#41;
    s=w+d/2
    var=w*&#40;1-s&#41;**2+d*&#40;1/2-s&#41;**2+l*&#40;0-s&#41;**2
    sigma=var**.5
    return (&#40;s-1/2&#41;/sigma-1.96/N**.5,&#40;s-1/2&#41;/sigma,(&#40;s-1/2&#41;/sigma+1.96/N**.5&#41;)

A reasonable approximation is elo*1.96/error_bars/sqrt(games)

('2moves_v1.epd', (0.676262000367134, 0.7200889327261298, 0.7639158650851257))
('5ply_v1.epd', (0.677036008282355, 0.7208629406413508, 0.7646898730003466))
('3moves_Elo2200.epd', (0.6828199381932489, 0.7266468705522449, 0.7704738029112408))

On the basis of this test the three books are indistiguishable.

Laskos · Post by **Laskos** » Tue Jun 27, 2017 7:03 pm

Dariusz Orzechowski wrote:If someone is interested, I've created a 5ply (2.5 moves) book in similar manner as my 2moves_v1 book. It contains over 97 thousands positions so there is plenty of room to cut it down and optimize for better properties.

As for "reasonable" openings - this is very vague term. What would be a rule to filter out "unreasonable" ones? I cannot think of anything good now. That given opening is not played by humans is not good enough argument. In go, after recent AlphaGo matches people started to play openings they long deemed unreasonable or just bad (for example very early 3-3 point invasion). Now they think it's fine to play like that.

5ply_v1.epd book (link expires in 7 days): http://dropmefiles.com/3jk3U

AlphaGo was not trained on random openings. Stockfish is literally trained on random 2-movers, which distorts its opening play for some 10 moves. One example to see the distortion is here (for moves 1-12 after the book):

http://www.talkchess.com/forum/viewtopi ... =0&t=63763

Calculating the impact of openings up to 12 moves, 2moves_v1.epd came up with 15%, 3moves_GM.epd (similarly built to 3moves_Elo2200.epd, but much fewer positions and not optimized) came up 22%. But sensitivity of 2moves_v1.epd is good, significantly higher than other suites, especially longer of human games. My goal was to create a suite containing many openings, to have sensitivity on par (or better) than 2moves_v1.epd, and to contain human over ELO 2200 moves. Humans at that level are not that crazy to play often random moves or very weak moves.

Dariusz Orzechowski · Post by **Dariusz Orzechowski** » Tue Jun 27, 2017 7:41 pm

Laskos wrote:AlphaGo was not trained on random openings. Stockfish is literally trained on random 2-movers, which distorts its opening play for some 10 moves.

AlphaGo was just an example that we don't really know what "reasonable" means. AG plays inhuman moves being extremely strong at the same time. If we could prove that "reasonable" book is better and provide some definition of "reasonability", we could create better book. Problem is I have no idea how to do this. "Reasonable" book is obviously better for tournament play but not neccessarily for engine development.

Laskos wrote:My goal was to create a suite containing many openings, to have sensitivity on par (or better) than 2moves_v1.epd, and to contain human over ELO 2200 moves. Humans at that level are not that crazy to play often random moves or very weak moves.

You certainly achieved this goal. But the question now is if using "crazy" positions in development book has any harmful effect on the playing strength. I don't know how to measure it.

Michel · Post by **Michel** » Wed Jun 28, 2017 10:52 am

Perhaps the sensitivity of the book is not so much determined by the number of moves but rather by the amount of material (game phase) that is still on the board?

Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency