Opening testing suites efficiency

Laskos · Post by **Laskos** » Wed Jun 21, 2017 2:34 pm

I wanted to test (with Komodo) the sensitivity to doubling time control of different known to me opening suites. The time control was 6''+0.06'' versus 3''+0.03''. One overlooked fact, with for example Stockfish Framework standard opening suite 2 moves_v1.epd, is that those 2 random moves distort all the opening phase of the play as shown in this thread:

http://www.talkchess.com/forum/viewtopi ... =0&t=63763

I separated the suites in ones distorting the openings of Standard Chess (and they have higher sensitivity here), and those playing reasonable openings. The important thing to look here is the t-value (sensitivity) of the difference, and the better indicator is "Normalized ELO" t-value, and not "ELO" t-value.

"Normalized ELO" is proposed recently by Michel Van den Bergh here:
http://talkchess.com/forum/viewtopic.ph ... t&start=20

Here are the results:

Opening suites which distort the opening phase

Code: Select all

Chess960.epd
Score of K2 vs K1&#58; 1172 - 119 - 709  &#91;0.763&#93; 2000

ELO difference&#58; 203.35 +/- 12.78
t-value = 15.9

Normalized ELO difference&#58; 0.868
t-value = 19.8



2moves_v1.epd
Score of K2 vs K1&#58; 1164 - 144 - 692  &#91;0.755&#93; 2000

ELO difference&#58; 195.51 +/- 12.91
t-value = 15.1

Normalized ELO difference&#58; 0.813
t-value = 18.6

Opening suites with reasonable openings

Code: Select all

3moves_GM.epd
Score of K2 vs K1&#58; 1085 - 126 - 789  &#91;0.740&#93; 2000

ELO difference&#58; 181.48 +/- 12.11
t-value = 15.0

Normalized ELO difference&#58; 0.782
t-value = 17.8



8moves_v3.pgn
Score of K2 vs K1&#58; 1013 - 113 - 874  &#91;0.725&#93; 2000

ELO difference&#58; 168.40 +/- 11.45
t-value = 14.7

Normalized ELO difference&#58; 0.749
t-value = 17.1



8moves_GM.pgn
Score of K2 vs K1&#58; 1002 - 137 - 861  &#91;0.716&#93; 2000

ELO difference&#58; 160.85 +/- 11.57
t-value = 13.9

Normalized ELO difference&#58; 0.699
t-value = 15.9

The shortcomings of the two top suites: Chess960 - only 960 starting positions. 3moves_GM - only 1170 positions. In fast games that shouldn't be a problem even with tens of thousands of games, no game will be a repeat.

Laskos · Post by **Laskos** » Sat Jun 24, 2017 10:05 pm

I created a new opening suite with very good properties consisting of 6533 opening 3-mover positions played by humans over ELO 2200. The database was "KingBase Lite" of about 1 million human games (above ELO 2200). As mentioned before, some suites like Stockfish Framework 2moves_v1.epd have good signal to noise ratio, but ruin the opening phase of the game (first say 10 moves are completely abnormal due to the first 2 silly random moves), other suites obeying normal opening play have low sensitivity (signal to noise ratio, or t-value), or are too short. It seems my new "3moves_Elo2200.epd" suite has:

1/ High signal to noise ratio, even higher than 2moves_v1
2/ Has mostly reasonable openings, so that opening phase of the game can be tested too
3/ Has sufficient number of unique opening 3-mover positions to be used in many thousands of games, as developers need

The suite is uploaded here:
http://s000.tinyupload.com/?file_id=687 ... 2789470066

I built this suite as follows: from that million human games, I filtered about 9000 unique 3-move positions. Then, I analyzed with Stockfish all of them for 1 second each. The I filtered the final 6533 positions to have the eval of Stockfish between [-0.40, 0.60], to not have very unbalanced or too wrong openings. If using pentanomial for calculating the variance, one might like the unbalanced ones, but this is a another topic.

So, this suite has a clear advantage over 2moves_v1 that it mostly has reasonable openings. The harder to achieve goal is to have high sensitivity (t-value), as shown in the previous post. But this goal seems to have been achieved too, as this suite compares well with 2moves_v1 even in this department. It is a bit more drawish, but win/loss ratio is significantly higher.

Here are the signal to noise ratios of the two suites for Stockfish dev and Komodo. In ELO and Normalized ELO (which seems to be more relevant than ELO), computed here in determining the benefit from the doubling in time in self-games.

Stockfish 210617:

Code: Select all

2000 games each run, 6''+ 0.06'' vs 3''+ 0.03''

                                                                                      t-value
                                     ELO        t-value ELO       Normalized ELO    Normalized ELO
===================================================================================================
2moves_v1.epd &#40;40456&#41; |            164.50          25.8              0.714             31.9
3moves_Elo2200 &#40;6533&#41; |            169.49          27.0              0.760             34.0
===================================================================================================

Komodo 11.01:

Code: Select all

2000 games each run, 6''+ 0.06'' vs 3''+ 0.03''

                                                                                      t-value
                                     ELO        t-value ELO       Normalized ELO    Normalized ELO
===================================================================================================
2moves_v1.epd &#40;40456&#41; |            195.51          27.4              0.813             36.5
3moves_Elo2200 &#40;6533&#41; |            192.24          29.1              0.854             38.3
===================================================================================================

Michel · Post by **Michel** » Sun Jun 25, 2017 12:35 am

Interesting!

I like your (standard) terminology "signal to noise ratio" as an alternative to normalize elo. This conveys better the idea. Note that the S/N ratio in engineering would be the square of what we are using here since in engineering the square (power) of a signal is used. But let's ignore this point.

From your earlier post I had decided that it was mainly the number of moves in the book that determines the S/N ratio. However if it is true that your latest 3 move book is better than 2moves_v1.epd (you did not post confidence intervals) then that conclusion is wrong.

Of course since the positions in 2moves_v1.epd are unbalanced one should strictly speaking use the pentanomial model to measure the S/N ratio. However since fishtest does not use the pentanomial model this is a bit pointless.

Laskos · Post by **Laskos** » Sun Jun 25, 2017 11:54 am

Michel wrote:Interesting!

I like your (standard) terminology "signal to noise ratio" as an alternative to normalize elo. This conveys better the idea. Note that the S/N ratio in engineering would be the square of what we are using here since in engineering the square (power) of a signal is used. But let's ignore this point.

From your earlier post I had decided that it was mainly the number of moves in the book that determines the S/N ratio. However if it is true that your latest 3 move book is better than 2moves_v1.epd (you did not post confidence intervals) then that conclusion is wrong.

Of course since the positions in 2moves_v1.epd are unbalanced one should strictly speaking use the pentanomial model to measure the S/N ratio. However since fishtest does not use the pentanomial model this is a bit pointless.

That would mean that signal to noise ratio terminology in engineering is inversely proportional to the number of games for the same LOS value. I post here the data with 95% (1.96 SD) confidence intervals:

Stockfish 210617:

Code: Select all

2000 games each run, 6''+ 0.06'' vs 3''+ 0.03'' 

                                                                                    
                                          ELO               Normalized ELO    
============================================================================
2moves_v1.epd &#40;40456&#41; |            164.50 +/- 11.67         0.714 +/- 0.044             
3moves_Elo2200 &#40;6533&#41; |            169.49 +/- 11.39         0.760 +/- 0.044            
============================================================================

Komodo 11.01

Code: Select all

2000 games each run, 6''+ 0.06'' vs 3''+ 0.03'' 

                                                                                    
                                          ELO               Normalized ELO    
============================================================================
2moves_v1.epd &#40;40456&#41; |            195.51 +/- 12.91         0.813 +/- 0.044             
3moves_Elo2200 &#40;6533&#41; |            192.24 +/- 12.00         0.854 +/- 0.044            
============================================================================

So, 3moves_Elo2200 compares favorably with 2moves_v1 even in sensitivity. 2moves_v1 probably can be optimized to have a bit higher sensitivity, as I too believe that the number of moves is the most important factor. But the main advantage of my suite is that it leads to reasonable openings, while 2moves_v1 ruins the openings for some 10 moves. Now, there is one issue even with my suite. It does play reasonable openings, but their representativity as usual human play is not that good. I collected them from 1 million human games as all equal 9000 3-mover unique positions. Each of them appears only once in my suite (analyzed by Stockfish to be moderate). Human play does not show such democracy among these 3-mover openings, some appear in 50000 games, some in 1 game. Here is the distribution of 3-mover openings in 1 million human ELO 2200+ games:

About 80% of openings out of one million are concentrated on 100 opening 3-mover positions, so an engine provided with 1 million PGN book of human games cut to 3 moves will play mostly 100 same 3-mover openings. This is not enough for a test suite. Conciliate both human representativity of games and diversity is hard. But at least my suite plays some _reasonable_ openings, unlike 2moves_v1. Not mostly the best or most frequently played by humans, but reasonable.

Kotlov · Post by **Kotlov** » Sun Jun 25, 2017 12:14 pm

I create program for ganerate random openings with ~0.00 score.

out.epd
http://dropmefiles.com/ymu8c

Michel · Post by **Michel** » Sun Jun 25, 2017 2:31 pm

That would mean that signal to noise ratio terminology in engineering is inversely proportional to the number of games for the same LOS value.

Yes. And this makes good sense!

Wikipedia has a long article about SNR.

https://en.wikipedia.org/wiki/Signal-to-noise_ratio

The definition mu/sigma ("normalized elo") is mentioned as an alternative definition for SNR (about half way the article).

Laskos · Post by **Laskos** » Sun Jun 25, 2017 6:53 pm

Kotlov wrote:I create program for ganerate random openings with ~0.00 score.

out.epd
http://dropmefiles.com/ymu8c

I checked it: it comes out with worst sensitivity (worse than 8moves_GM), and has some weird openings.

brtzsnr · Post by **brtzsnr** » Sun Jun 25, 2017 8:51 pm

Laskos wrote: The suite is uploaded here:
http://s000.tinyupload.com/?file_id=687 ... 2789470066

Great! Thank you for this work, very important for engine testing.

Do you think others can distribute this file?

Laskos · Post by **Laskos** » Sun Jun 25, 2017 9:49 pm

brtzsnr wrote:
Laskos wrote: The suite is uploaded here:
http://s000.tinyupload.com/?file_id=687 ... 2789470066
Great! Thank you for this work, very important for engine testing.

Do you think others can distribute this file?

Sure, I would be glad. It probably reduces the number of necessary games for LOS or SPRT stop by some 10%, and will test also the normal opening phase of the game (compared with 2moves_v1.epd).

brtzsnr · Post by **brtzsnr** » Mon Jun 26, 2017 9:00 am

Here is the first data point

Code: Select all

2moves_v1.pgn
Score of ./zurichess vs ./master&#58; 655 - 693 - 1132  &#91;0.492&#93; 2480
Elo difference&#58; -5.32 +/- 10.07
SPRT&#58; llr -1.89, lbound -1.87, ubound 3.34 - H0 was accepted
Finished match

3moves_Elo2200.epd
Score of ./zurichess vs ./master&#58; 1822 - 1814 - 3614  &#91;0.501&#93; 7250
Elo difference&#58; 0.38 +/- 5.66
SPRT&#58; llr -1.88, lbound -1.87, ubound 3.34 - H0 was accepted
Finished match

The bounds I use are

Code: Select all

Elo0&#58; 0.00 Elo1&#58; 6.00
Alpha&#58; 0.03 Beta&#58; 0.15
LLR&#58;-?.?? &#91;-1.87&#58;+3.34

3moves_Elo2200.epd needed about 3x more games.

Opening testing suites efficiency

Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency