Opening testing suites efficiency

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Opening testing suites efficiency

Post by Laskos »

Dariusz Orzechowski wrote:
Laskos wrote:AlphaGo was not trained on random openings. Stockfish is literally trained on random 2-movers, which distorts its opening play for some 10 moves.
AlphaGo was just an example that we don't really know what "reasonable" means. AG plays inhuman moves being extremely strong at the same time. If we could prove that "reasonable" book is better and provide some definition of "reasonability", we could create better book. Problem is I have no idea how to do this. "Reasonable" book is obviously better for tournament play but not neccessarily for engine development.
Laskos wrote:My goal was to create a suite containing many openings, to have sensitivity on par (or better) than 2moves_v1.epd, and to contain human over ELO 2200 moves. Humans at that level are not that crazy to play often random moves or very weak moves.
You certainly achieved this goal. But the question now is if using "crazy" positions in development book has any harmful effect on the playing strength. I don't know how to measure it.
I tested overnight (these tests take a lot of time) at more respectable time control 15''+ 0.15'' versus 7.5''+ 0.075'' (close to Stockfish testing STC) the sensitivity (SNR) of 2moves_v1.epd and 3moves_Elo2200.epd. 2000 games are not enough, but I wanted to have a picture at longer time control. I got very similar result to yours:

Code: Select all

2moves_v1.epd:

Score of SF2 vs SF1: 818 - 143 - 1039  [0.669] 2000
ELO difference: 122.04 +/- 10.39
Finished match



3moves_Elo2200.epd:

Score of SF2 vs SF1: 753 - 96 - 1151  [0.664] 2000
ELO difference: 118.53 +/- 9.59
Finished match
ELO-wise a bit of advantage of 2moves_v1.epd, but within noise. What we are interested in is SNR (or Normalized Elo):

Code: Select all

2moves_v1.epd:       0.5574 +/- 0.0438
3moves_Elo2200.epd:  0.5838 +/- 0.0438
A bit of an indication that 3moves_Elo2200.epd has a better SNR, but within noise too. From all data collected so far, it seems that 3moves_Elo2200.epd has at least as good SNR as 2moves_v1.epd, and plays reasonable openings.


Your 5ply_v1.epd openings are excellent in building unbalanced early opening positions and using pentanomial variance for LLR, which is the future of testing engines (see here http://www.talkchess.com/forum/viewtopic.php?t=61245 ). And the requirement that openings are "reasonable" becomes irrelevant, computer Chess itself will become "unreasonable" because of very high draw rates with normal openings. When the testers (like Stockfish Framework) will get to draw rates above 80-85% with balanced openings, or in Bayeselo terminology, eloDraw above 500 or so, the optimum openings are defined by eloBias = eloDraw, with resulting draw rate of 50%. The improvement in number of necessary games to the same LOS or SPRT stop will be from ~4 for draw rate of 85% to an order of magnitude for above 90% draw rate compared to balanced positions. This sort of new testing with unbalanced positions already happened in Checkers more than a decade ago.

I analyzed with Stockfish your 5ply_v1.epd file and uploaded the openings with 70cp-110cp unbalance, suited for wide range of draw rates above 80% (or eloBias above 500 or so).
http://s000.tinyupload.com/?file_id=079 ... 7904268641
This 5ply_v1_unbalanced.epd contains 15083 unique unbalanced positions (70cp-110cp) for future testing, when Cutechess-Cli or similar tool will use pentanomial variance, and draw rates with balanced openings will become above 80% or so. It is not so far in the future, at least for Stockfish. The openings are skewed towards lower values in the [70,110] cp unbalance interval.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Opening testing suites efficiency

Post by Michel »

I posted a small document about "normalized elo"

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

explaining why it is the correct quantity for comparing different books.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Opening testing suites efficiency

Post by Laskos »

Michel wrote:I posted a small document about "normalized elo"

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

explaining why it is the correct quantity for comparing different books.
Very good, very clear. Thanks!
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Opening testing suites efficiency

Post by Adam Hair »

Michel wrote:I posted a small document about "normalized elo"

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

explaining why it is the correct quantity for comparing different books.
I also thank you Michel :)
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Opening testing suites efficiency

Post by Laskos »

Dariusz Orzechowski wrote:If someone is interested, I've created a 5ply (2.5 moves) book in similar manner as my 2moves_v1 book. It contains over 97 thousands positions so there is plenty of room to cut it down and optimize for better properties.

As for "reasonable" openings - this is very vague term. What would be a rule to filter out "unreasonable" ones? I cannot think of anything good now. That given opening is not played by humans is not good enough argument. In go, after recent AlphaGo matches people started to play openings they long deemed unreasonable or just bad (for example very early 3-3 point invasion). Now they think it's fine to play like that.

5ply_v1.epd book (link expires in 7 days): http://dropmefiles.com/3jk3U
Here are the raw data used for our test-suites.


From KingBase ELO 2200+ I had 8256 unique 3-mover (6-ply) openings. Analyzed with Stockfish, their eval distribution is the following:
Image
Mean: 26.52cp
Median: 23cp
Standard Deviation: 46.6cp

For creating 3moves_Elo2200.epd suite, I cut it to [-0.4,0.6] interval for the eval.



The 5ply_v1.epd has 97448 unique 5-ply openings distributed in Stockfish eval as this:
Image
Mean: 21.77cp
Median: 32cp
Standard Deviation: 47.6cp

Do you have any idea why the shape of the eval is so curious in 5ply_v1.epd case? For building unbalanced suite from this, I cut it to [-1.1,-0.7] and [0.7,1.1] intervals for the eval. When people will start using unbalanced positions, your dataset will be one of the best to optimize for them.
Dariusz Orzechowski
Posts: 44
Joined: Thu May 02, 2013 5:23 pm

Re: Opening testing suites efficiency

Post by Dariusz Orzechowski »

Laskos wrote:Your 5ply_v1.epd openings are excellent in building unbalanced early opening positions and using pentanomial variance for LLR, which is the future of testing engines (see here http://www.talkchess.com/forum/viewtopic.php?t=61245 ). And the requirement that openings are "reasonable" becomes irrelevant, computer Chess itself will become "unreasonable" because of very high draw rates with normal openings. When the testers (like Stockfish Framework) will get to draw rates above 80-85% with balanced openings, or in Bayeselo terminology, eloDraw above 500 or so, the optimum openings are defined by eloBias = eloDraw, with resulting draw rate of 50%. The improvement in number of necessary games to the same LOS or SPRT stop will be from ~4 for draw rate of 85% to an order of magnitude for above 90% draw rate compared to balanced positions. This sort of new testing with unbalanced positions already happened in Checkers more than a decade ago.

I analyzed with Stockfish your 5ply_v1.epd file and uploaded the openings with 70cp-110cp unbalance, suited for wide range of draw rates above 80% (or eloBias above 500 or so).
http://s000.tinyupload.com/?file_id=079 ... 7904268641
This 5ply_v1_unbalanced.epd contains 15083 unique unbalanced positions (70cp-110cp) for future testing, when Cutechess-Cli or similar tool will use pentanomial variance, and draw rates with balanced openings will become above 80% or so. It is not so far in the future, at least for Stockfish. The openings are skewed towards lower values in the [70,110] cp unbalance interval.
I'm uploading my unbalanced set of 5ply openings that I got as a "by-product" from my work. Maybe it could be useful. It contains over 127k positions, most of them should be in interesting range 70-110 although it is very crude so I would expect that completely lopsided positions could also have slipped in there. I tried to work on 5ply book last year but stopped due to lack of time (and ideas).


5ply_unbalanced_127k.epd book (link expires in 7 days): http://dropmefiles.com/VDUNG
Michel wrote:I posted a small document about "normalized elo"

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

explaining why it is the correct quantity for comparing different books.
Very nice! I you could also add formula for pentanomial case it would be great to have it in one place.
Dariusz Orzechowski
Posts: 44
Joined: Thu May 02, 2013 5:23 pm

Re: Opening testing suites efficiency

Post by Dariusz Orzechowski »

Laskos wrote:Do you have any idea why the shape of the eval is so curious in 5ply_v1.epd case? For building unbalanced suite from this, I cut it to [-1.1,-0.7] and [0.7,1.1] intervals for the eval. When people will start using unbalanced positions, your dataset will be one of the best to optimize for them.
This is by design. I filtered out dead even positions around 0.00. My goal was to have mostly around 0.3-0.8 out of the book. In unbalanced set I mentioned above there should be mostly 0.8-1.3 (I'm not sure I remember correctly the upper limit now, it's maybe a bit higher like 1.5).
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Opening testing suites efficiency

Post by Laskos »

Michel wrote:Perhaps the sensitivity of the book is not so much determined by the number of moves but rather by the amount of material (game phase) that is still on the board?
Interesting, I tested your theory. From 8moves_GM book I took only those positions with full material. I collected all my results with Komodo 11.01 for different test-suites, and 8moves_GM_Full_Mat came significantly above its parent book 8moves_GM, but below the shortest of the books. The effect seems to be combined, most material + less advanced openings (+ maybe reasonable play?). The error bars are a bit too large, but I don't have a 32 core monster.

2000 games each run with Komodo 11.01 self-games at 6''+ 0.06'' versus 3''+ 0.03''.

Normalized ELO difference: error for 95% confidence = 0.0438 for all

Code: Select all

Chess960:               0.868 
3moves_Elo2200:         0.854
2moves_v1:              0.813
3moves_GM:              0.782  
8moves_GM_Full_Mat:     0.771
8moves_v3:              0.749
8moves_GM:              0.699
Dariusz Orzechowski
Posts: 44
Joined: Thu May 02, 2013 5:23 pm

Re: Opening testing suites efficiency

Post by Dariusz Orzechowski »

Michel wrote:I posted a small document about "normalized elo"

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

explaining why it is the correct quantity for comparing different books.
It looks like the first formula on the second page (unfortunately it's not numbered) is wrong. Sigma_0 should be

Code: Select all

sqrt((w + 1/4*d) - s^2)
instead of

Code: Select all

sqrt((w + 1/4*d)^2 - s^2)
Dariusz Orzechowski
Posts: 44
Joined: Thu May 02, 2013 5:23 pm

Re: Opening testing suites efficiency

Post by Dariusz Orzechowski »

Laskos wrote:Image
Does it show static eval or after some search? I don't know what 'V6' means here. In my book I filtered positions based on eval at depth 12. By the way, it would be probably better to use abs(eval) instead of eval.