The number of games required by an SPRT is very variable. You cannot draw any conclusions from a single trial.3moves_Elo2200.epd needed about 3x more games.
It would be like comparing two engines on the basis of one game.
Moderators: hgm, Rebel, chrisw
The number of games required by an SPRT is very variable. You cannot draw any conclusions from a single trial.3moves_Elo2200.epd needed about 3x more games.
Interesting. One run says little, observe the confidence intervals, but if SPRT stops will systematically show lower games for stop with 2 moves_v1, then it is probably better as signal to noise ratio. I simulated in the past these SPRT stops, they vary in number of games a lot from run to run. But it is possible that for small ELO differences, my test-suite behaves differently from my tests with huge Elo differences from doubling in time.brtzsnr wrote:Here is the first data point
The bounds I use areCode: Select all
2moves_v1.pgn Score of ./zurichess vs ./master: 655 - 693 - 1132 [0.492] 2480 Elo difference: -5.32 +/- 10.07 SPRT: llr -1.89, lbound -1.87, ubound 3.34 - H0 was accepted Finished match 3moves_Elo2200.epd Score of ./zurichess vs ./master: 1822 - 1814 - 3614 [0.501] 7250 Elo difference: 0.38 +/- 5.66 SPRT: llr -1.88, lbound -1.87, ubound 3.34 - H0 was accepted Finished match
3moves_Elo2200.epd needed about 3x more games.Code: Select all
Elo0: 0.00 Elo1: 6.00 Alpha: 0.03 Beta: 0.15 LLR:-?.?? [-1.87:+3.34
It seems like an interesting research question to find a method to improve the signal to noise ratio of a book. The SNR can be measured with Kai's method (a self match with time odds).Dariusz Orzechowski wrote:If someone is interested, I've created a 5ply (2.5 moves) book in similar manner as my 2moves_v1 book. It contains over 97 thousands positions so there is plenty of room to cut it down and optimize for better properties.
As for "reasonable" openings - this is very vague term. What would be a rule to filter out "unreasonable" ones? I cannot think of anything good now. That given opening is not played by humans is not good enough argument. In go, after recent AlphaGo matches people started to play openings they long deemed unreasonable or just bad (for example very early 3-3 point invasion). Now they think it's fine to play like that.
5ply_v1.epd book (link expires in 7 days): http://dropmefiles.com/3jk3U
I would like to add that this way, testing almost equal engines, you will need gazillion games to see the difference in sensitivity between opening suites with SPRT stops. I have taken well separated by some 50 ELO points engines:brtzsnr wrote:Here is the first data point
The bounds I use areCode: Select all
2moves_v1.pgn Score of ./zurichess vs ./master: 655 - 693 - 1132 [0.492] 2480 Elo difference: -5.32 +/- 10.07 SPRT: llr -1.89, lbound -1.87, ubound 3.34 - H0 was accepted Finished match 3moves_Elo2200.epd Score of ./zurichess vs ./master: 1822 - 1814 - 3614 [0.501] 7250 Elo difference: 0.38 +/- 5.66 SPRT: llr -1.88, lbound -1.87, ubound 3.34 - H0 was accepted Finished match
3moves_Elo2200.epd needed about 3x more games.Code: Select all
Elo0: 0.00 Elo1: 6.00 Alpha: 0.03 Beta: 0.15 LLR:-?.?? [-1.87:+3.34
Code: Select all
{164, 167, 60, 149, 435, 315, 180, 252, 269, 140, 307, 150, 108, 218, 138, 230, 419, 185, 128, 167}
Average: 209 +/- 42 games
Median: 173.5 games
Code: Select all
{304, 124, 353, 152, 61, 196, 103, 145, 63, 192, 158, 157, 287, 244, 386, 86, 373, 200, 71, 382}
Average: 202 +/- 47 games
Median: 175 games
Code: Select all
Score of Stockfish_210617_2x vs Stockfish_210617_1x: 947 - 101 - 952 [0.712] 2000
ELO difference: 156.81 +/- 10.88
Code: Select all
Score of Stockfish_210617_2x vs Stockfish_210617_1x: 976 - 113 - 911 [0.716] 2000
ELO difference: 160.42 +/- 11.19
Code: Select all
Score of Stockfish_210617_2x vs Stockfish_210617_1x: 910 - 82 - 1008 [0.707] 2000
ELO difference: 153.02 +/- 10.46
Code: Select all
from __future__ import division
def sens(W=None,D=None,L=None):
N=W+D+L
(w,d,l)=(W/N,D/N,L/N)
s=w+d/2
var=w*(1-s)**2+d*(1/2-s)**2+l*(0-s)**2
sigma=var**.5
return ((s-1/2)/sigma-1.96/N**.5,(s-1/2)/sigma,((s-1/2)/sigma+1.96/N**.5))
On the basis of this test the three books are indistiguishable.('2moves_v1.epd', (0.676262000367134, 0.7200889327261298, 0.7639158650851257))
('5ply_v1.epd', (0.677036008282355, 0.7208629406413508, 0.7646898730003466))
('3moves_Elo2200.epd', (0.6828199381932489, 0.7266468705522449, 0.7704738029112408))
AlphaGo was not trained on random openings. Stockfish is literally trained on random 2-movers, which distorts its opening play for some 10 moves. One example to see the distortion is here (for moves 1-12 after the book):Dariusz Orzechowski wrote:If someone is interested, I've created a 5ply (2.5 moves) book in similar manner as my 2moves_v1 book. It contains over 97 thousands positions so there is plenty of room to cut it down and optimize for better properties.
As for "reasonable" openings - this is very vague term. What would be a rule to filter out "unreasonable" ones? I cannot think of anything good now. That given opening is not played by humans is not good enough argument. In go, after recent AlphaGo matches people started to play openings they long deemed unreasonable or just bad (for example very early 3-3 point invasion). Now they think it's fine to play like that.
5ply_v1.epd book (link expires in 7 days): http://dropmefiles.com/3jk3U
AlphaGo was just an example that we don't really know what "reasonable" means. AG plays inhuman moves being extremely strong at the same time. If we could prove that "reasonable" book is better and provide some definition of "reasonability", we could create better book. Problem is I have no idea how to do this. "Reasonable" book is obviously better for tournament play but not neccessarily for engine development.Laskos wrote:AlphaGo was not trained on random openings. Stockfish is literally trained on random 2-movers, which distorts its opening play for some 10 moves.
You certainly achieved this goal. But the question now is if using "crazy" positions in development book has any harmful effect on the playing strength. I don't know how to measure it.Laskos wrote:My goal was to create a suite containing many openings, to have sensitivity on par (or better) than 2moves_v1.epd, and to contain human over ELO 2200 moves. Humans at that level are not that crazy to play often random moves or very weak moves.