Opening book from a statistical point of view

Zenmastur · Post by **Zenmastur** » Sat Jul 30, 2016 11:08 am

Dann Corbit wrote:First, if you have less than 32 games, the statistical significance of wins and losses is so small you should ignore it and rely purely on the computer evaluation.

I disagree. Example, if one move has been played 4 times and scored 4 points, while some other move from the same position has been played 4 times and scored 0 points, I believe the data is statistically significant and the former move should be preferred over the later. The Wilson's 95% score interval for the first is 1.00 to 0.51 while for the former it's 0.00 to 0.49. This is enough to exclude the second move from consideration. This assumes all games were played by the same engine under the same time controls against approximately equal opponents. Any move who's interval's maximum value is less than the minimum value of the best move's interval can safely be excluded from further consideration. When using a 95% confidence interval as few as 8 games are useful for excluding moves from further consideration in this way.

There is no requirement to use a 95% confidence interval. A 50% interval is adequate. The advantage of using such a loose interval is that it's requires fewer games played from any position to be useful. Some may object to this, but I would note that it's by far the most efficient use of resources. Analyzing opening positions is very time consuming and resource heavy. Playing full games is even more resource dependent. A good example of using a 50% confidence interval in a critical situation is demonstrated by the US military's almost exclusive use of a 50% interval in their Joint Munitions Effectiveness Manuals (JMEMs) which are used to calculated how many weapons should be used and how they are employed to achieve the CINC's intent. Military resources are at least as restricted as an engine programmers and few situations are more critical than when someones life is on the line.

Dann Corbit wrote:If the computer evaluation is so shallow that at the current time control you can out-think it, then you are officially out of book.

I think the question you are asking is : How to know what the breakpoint is to trust the data?

Clearly, this depends on how good your data is.
For instance from human games, is the data from correspondence chess between world championship candidates? Is it from FICS games between 'beanhead' and 'gizmo' with Elo of 800 and 750?

Is the data from bullet games? Is the data from TCEC games?

If you separate the data you collect into wins/losses and draws by type, then you can gain a lot more from it.

I'm not a big fan of statistical methods applied to human games. The primary problem is the data violates most if not ALL assumptions about data when using most statistical methods therefore the results are of dubious value. This is, no doubt, one of the reasons books based on human games are chocked full of losing lines of play.

Dann Corbit wrote:You should also store your own engine's wins/losses/draws from the given position. Even if it is a good position, you may want to avoid it if your engine does not play it well.

A very good point! You should never mix other games in with the engine's games in the statistics area of the book. Always keep your engines statistics separate for all other games. It dilutes them to a point that they become as worthless and as dubious as the rest of the data.

Regards,

Forrest

stegemma · Post by **stegemma** » Sat Jul 30, 2016 12:09 pm

Here are the results of the first real test. For historical reasons, I test against LarsenVB, that is a good program enough stronger than Satana but not so strong to always win:

Code: Select all

SATANA/LVB	NO	BK	TT
NOTHING    	2	16	11
BK         	16	*	9
TT         	11	9	*

With no book and no TT satana scores only 2 points. with book only it scores 16 points on 40 games (60 s + 100ms). The TT is bugged, so I don't care about those results. A total of 120 games is not very big but I think it gives an hint on the direction to take.

That means that having the book seems to be better to have no book at all, for Satana vs LVB. It means also that the book implementation and the data collected are enough good, in this context.

Ajedrecista · Post by **Ajedrecista** » Sat Jul 30, 2016 12:22 pm

Hello again:

Folkert's post refers to what I already wrote. Edmund's solution looks more difficult at first glance but it can be given a go eventually.

I used the online calculator that I linked to yesterday. Ordering from highest to lowest lower bound of the Wilson score interval (confidence level of 95% and rounding up to 0.01%). From white POV:

Code: Select all

N/A&#58; not available.

lb Wsi  hash                            white_score     black_score        %

57.17%  # A6F5BE14996CE34B Ng1-f3        141781          105371          57.37%
56.02%  # A9D5495D9934AFB0 Pg2-g3          4782            3596          57.08%
55.39%  # 80DCEC6D58853A63 Pd2-d4        686201          550747          55.48%
54.84%  # 9DE3F5BD01E05A54 Pc2-c4        198730          162554          55.01%
53.36%  # 98A8776C1A9B3E65 Pe2-e4       1271383         1108327          53.43%
50.10%  # 46D5AEC46E7F602A Pb2-b3         11438           11100          50.75%
50.08%  # A9CD43D265BB27C1 Nb1-c3          8670            8386          50.83%
46.38%  # 9DD7B0A819627060 Pf2-f4         13996           15814          46.95%
45.50%  # 9B31EA80413BAD98 Pb2-b4          3965            4551          46.56%
44.09%  # B6D55CBAA963D796 Pd2-d3            40              32          55.56%
38.10%  # B6CD5B84A78F4849 Pc2-c3            32              32          50.00%
34.24%  # A6CABC12D080A7CD Pf2-f3             2               0         100.00%
34.24%  # A65545D9385B5A3B Ph2-h3             2               0         100.00%
30.06%  # 26B24248B0B213A7 Ng1-h3             3               1          75.00%
26.82%  # A6CB76195BE75793 Pe2-e3            33              59          35.87%
 N/A    # A2D57A5C5D03A1B4 Nb1-a3             0               0           N/A
 N/A    # E6B53D23459CEBA5 Pa2-a3             0               0           N/A
 N/A    # 9EEDB481C8A96A61 Pa2-a4             0               0           N/A
 N/A    # 804C915C8920E580 Pg2-g4             0               0           N/A
 N/A    # 9FCA61F1E64C0CF0 Ph2-h4             0               0           N/A

I hope no typos. It is curious to see the poor score of 1.- e3 since it can easily transpose to d4 lines with solid play... I know, too few games.

It would be nice to compare these results with Edmund's proposal.

--------------------------------------------------

jdart wrote:I use something like:

frequency*winloss/(total frequency)

as the relative weight. But in addition I appy a "selectivity" value. If that value is high, weights of moves besides the best move are decreased in relative score; if the value is low they are boosted. And I also decrease even further the weight of moves with very bad winloss, in an amount based on the selectivity value.

If I understood correctly and only chosing the frequency formula for relative weights, using Excel and rounding up to 1e-4, except numbers < 1e-4:

Code: Select all

SUM&#40;relative weights&#41; ~ 0.542963818
&#40;Normalized relative weight&#41; = &#40;Relative weight&#41;/SUM&#40;relative weights&#41;

Rel.W.  N.R.W.   Move

0.2949  0.5431  Pe2-e4
0.1592  0.2931  Pd2-d4
0.0461  0.0849  Pc2-c4
0.0329  0.0606  Ng1-f3
0.0032  0.0060  Pf2-f4
0.0027  0.0049  Pb2-b3
0.0020  0.0037  Nb1-c3
0.0011  0.0020  Pg2-g3
0.0009  0.0017  Pb2-b4
9.3e-6  1.7e-5  Pd2-d3
7.7e-6  1.4e-5  Pe2-e3
7.4e-6  1.4e-5  Pc2-c3
7.0e-7  1.3e-6  Ng1-h3
4.6e-7  8.5e-7  Pf2-f3
4.6e-7  8.5e-7  Ph2-h3
 N/A     N/A    Nb1-a3
 N/A     N/A    Pa2-a3
 N/A     N/A    Pa2-a4
 N/A     N/A    Pg2-g4
 N/A     N/A    Ph2-h4

I see a drawback here: the relative frequency can be so important. Made up example:

Code: Select all

Move   White   Black
A       9900    9900
B        120      80

Rel.W.&#40;A&#41; = 0.495; N.R.W.&#40;A&#41; ~ 0.988
Rel.W.&#40;B&#41; = 0.006; N.R.W.&#40;B&#41; ~ 0.012

lb Wsi &#40;A, 95%) ~ 49.30%
lb Wsi &#40;B, 95%) ~ 53.08%

It can be biased to very popular lines. I know that Jon's idea can be more accurate with selective values and other things that he explained, as well as it could be more intended for a multi-move chosing algorithm instead of a best-move assesment algorithm.

Regards from Spain.

Ajedrecista.

Edmund · Post by **Edmund** » Sat Jul 30, 2016 1:10 pm

Ajedrecista wrote:Hello again:

Folkert's post refers to what I already wrote. Edmund's solution looks more difficult at first glance but it can be given a go eventually.

One advantage of my proposal is that you can take into account the relative strength of the players. I.e. if a weaker player manages to win with a certain move it has to be an even stronger move.

Anyway for comparison purposes we can also assume equal strength opponents. If you provide me with win/draw/loss counts per move for your sample I could plug them into my model and compare with your output.

Ajedrecista · Post by **Ajedrecista** » Sat Jul 30, 2016 1:24 pm

Hello Edmund:

Edmund wrote:Anyway for comparison purposes we can also assume equal strength opponents. If you provide me with win/draw/loss counts per move for your sample I could plug them into my model and compare with your output.

It is up to Stefano because they are his stats. I only did calculations with the data provided by Stefano.

I have a doubt with his stats: do white and black mean white points and black points with the usual {win, draw, lose} = {1, 0.5, 0} points score system or is he using {2, 1, 0} or something like that? I say this because it is very strange find only integer values.

Regards from Spain.

Ajedrecista.

stegemma · Post by **stegemma** » Sat Jul 30, 2016 3:16 pm

Ajedrecista wrote:Hello Edmund:

Edmund wrote:Anyway for comparison purposes we can also assume equal strength opponents. If you provide me with win/draw/loss counts per move for your sample I could plug them into my model and compare with your output.
It is up to Stefano because they are his stats. I only did calculations with the data provided by Stefano.

I have a doubt with his stats: do white and black mean white points and black points with the usual {win, draw, lose} = {1, 0.5, 0} points score system or is he using {2, 1, 0} or something like that? I say this because it is very strange find only integer values.

Regards from Spain.

Ajedrecista.

I use { +2, +1, 0 } and even this could be something to think about. If we want to award the winning moves, we could use { +3 +1 0 }, for sample but this is another story.

Edmund · Post by **Edmund** » Sat Jul 30, 2016 3:19 pm

stegemma wrote:I use { +2, +1, 0 } and even this could be something to think about. If we want to award the winning moves, we could use { +3 +1 0 }, for sample but this is another story.

Could you maybe report win/draw/losses separately so that I can run my model to replicate results?

stegemma · Post by **stegemma** » Sat Jul 30, 2016 3:26 pm

Edmund wrote:
stegemma wrote:I use { +2, +1, 0 } and even this could be something to think about. If we want to award the winning moves, we could use { +3 +1 0 }, for sample but this is another story.
Could you maybe report win/draw/losses separately so that I can run my model to replicate results?

Sure... just let me finish a test tourney and I can provide any data you like.

I was thinking to provide the code as a DLL, if somebody want to experiment with this kind of books.

Dann Corbit · Post by **Dann Corbit** » Sat Jul 30, 2016 6:27 pm

Zenmastur wrote:
Dann Corbit wrote:First, if you have less than 32 games, the statistical significance of wins and losses is so small you should ignore it and rely purely on the computer evaluation.
I disagree. Example, if one move has been played 4 times and scored 4 points, while some other move from the same position has been played 4 times and scored 0 points, I believe the data is statistically significant and the former move should be preferred over the later. The Wilson's 95% score interval for the first is 1.00 to 0.51 while for the former it's 0.00 to 0.49. This is enough to exclude the second move from consideration. This assumes all games were played by the same engine under the same time controls against approximately equal opponents. Any move who's interval's maximum value is less than the minimum value of the best move's interval can safely be excluded from further consideration. When using a 95% confidence interval as few as 8 games are useful for excluding moves from further consideration in this way.

There is no requirement to use a 95% confidence interval. A 50% interval is adequate. The advantage of using such a loose interval is that it's requires fewer games played from any position to be useful. Some may object to this, but I would note that it's by far the most efficient use of resources. Analyzing opening positions is very time consuming and resource heavy. Playing full games is even more resource dependent. A good example of using a 50% confidence interval in a critical situation is demonstrated by the US military's almost exclusive use of a 50% interval in their Joint Munitions Effectiveness Manuals (JMEMs) which are used to calculated how many weapons should be used and how they are employed to achieve the CINC's intent. Military resources are at least as restricted as an engine programmers and few situations are more critical than when someones life is on the line.
{snip}

I have found thousands of instances where a move is played simply due to tradition but in reality, there are much better alternatives or the game is even lost somewhere along the traditional chain.

People have played a sequence of moves in a standardized opening for years and years and a simple 36 ply search would have shown that there is a serious error somewhere in the sequence of moves that renders the entire chain utterly invalid.

stegemma · Post by **stegemma** » Sat Jul 30, 2016 7:24 pm

Edmund wrote:
stegemma wrote:I use { +2, +1, 0 } and even this could be something to think about. If we want to award the winning moves, we could use { +3 +1 0 }, for sample but this is another story.
Could you maybe report win/draw/losses separately so that I can run my model to replicate results?

I've added the draw count to my book (it could be useful in the future, maybe):

Code: Select all

hash	move	win	draw	lose
26B24248B0B213A7	Ng1-h3	1	1	0
A6F5BE14996CE34B	Ng1-f3	47608	46584	29403
A9CD43D265BB27C1	Nb1-c3	3233	2204	3091
A2D57A5C5D03A1B4	Nb1-a3	0	0	0
A65545D9385B5A3B	Ph2-h3	1	0	0
A9D5495D9934AFB0	Pg2-g3	1581	1620	988
A6CABC12D080A7CD	Pf2-f3	1	0	0
A6CB76195BE75793	Pe2-e3	8	17	21
B6D55CBAA963D796	Pd2-d3	15	10	11
B6CD5B84A78F4849	Pc2-c3	10	12	10
46D5AEC46E7F602A	Pb2-b3	4062	3314	3893
E6B53D23459CEBA5	Pa2-a3	0	0	0
9FCA61F1E64C0CF0	Ph2-h4	0	0	0
804C915C8920E580	Pg2-g4	0	0	0
9DD7B0A819627060	Pf2-f4	5112	3775	6021
98A8776C1A9B3E65	Pe2-e4	453627	364323	372098
80DCEC6D58853A63	Pd2-d4	239725	206864	171998
9DE3F5BD01E05A54	Pc2-c4	68170	62414	50082
9B31EA80413BAD98	Pb2-b4	1446	1074	1739
9EEDB481C8A96A61	Pa2-a4	0	0	0

Some count could be a little different from the previous one.

Opening book from a statistical point of view

Re: Opening book from a statistical point of view

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view.

Re: Opening book from a statistical point of view

Re: Opening book from a statistical point of view.