Perft(13) betting pool

petero2 · Post by **petero2** » Thu Jul 28, 2011 8:47 pm

Michel wrote:
This was just an experiment to see how far you could get with non-uniform sampling. For the record, I ran a longer test over night and with a bigger hash table. The standard deviation for the average-of-1000 walks seems to have stabilized at about 3.75e16, compared to 1.05e17 for uniform sampling.
Yes thanks for the experiment!

I agree that the point was to see how much
better non uniform sampling can be, all performance issues aside.

Your SD reduction seems to be 2.8.

I am curious to see if the node expansion approach can do better.
Theoretically I think it cannot. But who knows.

EDIT:

Just to be absolutely sure. I assume you keep the hash table between the 1000 walk runs? So that the later runs use a hash table with "ideal"
weights (I know the question will sound stupid).

Yes, I kept the hash table between the 1000 walk runs. The weights seemed to have stabilized after about 70000 1000-walk runs.

I have now tried to use node expansion to combine random walks with subtree splitting where the variance is high. The method doesn't seem to work, but here is what I did:

Each tree node contains an estimate of the subtree size below the node and an estimate of the standard deviation for the subtree size estimate.

1. Start with a single node representing the root position.

2. Find a leaf node by starting at the root and go to the child node with the highest estimated variance. Repeat until a leaf node is reached.

3. Split the selected leaf node by adding child nodes for each possible move.

4. Use N random walks for each new child node to get estimates for the tree size and standard deviation. (I have tried N=100 and N=1000.)

5. Propagate the new information back up to the root node, by summing the data for the child nodes.

6. Repeat from step 2.

In step 5, I have also tried to combine the child node data with the data from random walks made before the child nodes were added. I did this as a weighted average of the children information and the nodes' own information from random walks.

The problem with this algorithm is that I consistently get too small estimates, when I try to estimate perfT(12). When using N=100 in step 4, I quickly get estimates that are more than 5 standard deviations too low. When I use N=1000, it takes longer for the problem to show up, but eventually it happens also in this case.

If I modify step 2 so that it picks a random child node repeatedly until a leaf node is found, the bias seems to go away. (But this makes the method quite pointless, in that case you might as well just do repeated random walks instead.)

My guess is that step 2 introduces bias because subtrees that happen to have their size underestimated may also tend to have their variance underestimated, thereby causing them to stay around longer as leaf nodes before they are further subdivided. The effect is that at any given point in time, the tree will on average have an over-representation of nodes with underestimated subtree sizes.

Unless someone can come up with an idea how to get an unbiased estimator using subtree splitting, I am going to change my mind and instead believe that an optimized version of my dynamic weight adjustment method will be more efficient.

For the record, here is the code I used: http://web.comhem.se/petero2home/PerfTE ... 2.java.txt

Michel · Post by **Michel** » Fri Jul 29, 2011 5:45 am

My guess is that step 2 introduces bias because subtrees that happen to have their size underestimated may also tend to have their variance underestimated, thereby causing them to stay around longer as leaf nodes before they are further subdivided. The effect is that at any given point in time, the tree will on average have an over-representation of nodes with underestimated subtree sizes.

It seems reasonable explanation but it is weird anyway.

I would have stayed with non uniform weight selection at all cost.
The way I see it is the following.

Ultimately you would like to keep statistics for _every_ node you
visit (i.e. your hash table approach). Now this is expensive in CPU
cycles. So you would somehow like to limit the number of nodes
for which you keep statistics to those nodes that are likely to
substantially influence the result. This is basically a tradeoff.

Intuition _suggests_ that you should keep statistics at direct children of
nodes which correspond to a large tree so that in those nodes we can select
optimal weights.

I think there are various ways to achieve this

(1) Keep statistics for all nodes at a certain prechosen distance of
the root.

(2) Keep statistics for the children of the nodes whose estimated subtree
size is above a certain prechosen treshold.

(3) Dynamically adjust the set of nodes for which stats are kept by
some sort of node expansion algorithm.

(1)(2) seem easy to implement. (3) (the UCT like approach) seems less clear.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jul 29, 2011 4:42 pm

Rein Halbersma wrote:
Michel wrote: The optimal weight p_i for move i is proportional to sqrt(x_i^2+s_i^2).
where x_i is the size of the resulting subtree and s_i^2 is the variance
of the estimator measuring that subtree.

This is a mathematical fact. No assumptions are necessary for
deriving this formula.
Relating to your observation that the sum-rule makes reasoning about MC-perft easier than the min-max rules in MC-search, I have a question.

Is it really true that you don't have to make any assumptions to derive the optimal weights? Not even some central limit theorem kind of thing? After all, the CLT is also a kind of sum-rule for independent stochastic variables. And, if that is somehow relevant, could the extreme value theorem (a kind of min/max rule for independent stochastic variables) be used to infer optimal weights for min-max MC-trees? Just speculation, but it would be nice, no?

I think that an extreme value distribution is appropriate for min-max. Some time ago we discussed a search with "random" evaluation and all leaves extending to the same depth, can give a strong engine. The idea is if you have a node with 20 leaves, and another one with only 5, and then you take the maximum of random numbers for each leaf the first node is likely to have a bigger score. So it is a sort of mobility evaluation and is informally known as the "beal effect". There was some long debate about it some time ago.

Ajedrecista · Post by **Ajedrecista** » Fri Jul 29, 2011 5:43 pm

Hi Daniel:

I took a look on your new method:

http://talkchess.com/forum/viewtopic.ph ... 49&t=39678

The Perft count seems very good (it almost fits in my new interval (1.9804681 ± 0.0002024)e+18, which is a total surprise for me!). I have calculated (I hope there are not errors) the percentage of each one of twenty first possible moves in your sample and also did the same with Perft(11) and Perft(12) (finally, there is not an 'odd-even effect' in percentages, maybe only in BF's). More less the percentages are these:

Moves: %[Perft(11)] %[Perft(12)] %(Shawul's sample)

a3 2.879568278% 2.904139798% 2.792%
a4 4.054742232% 4.092857514% 4.147%
b3 3.790445879% 3.830269713% 3.662%
b4 3.833779233% 3.837974861% 3.771%
c3 4.397087674% 4.377817668% 4.144%
c4 4.939128103% 4.963636424% 5.003%
d3 7.239429777% 7.300932114% 6.943%
d4 10.08667334% 10.06586933% 10.809%
e3 11.49259878% 11.39230718% 12.158%
e4 11.71984731% 11.55618883% 12.495%
f3 2.460575949% 2.470542707% 2.26%
f4 3.259476824% 3.262699557% 3.203%
g3 3.945500296% 3.975182931% 3.67%
g4 3.526143586% 3.528380922% 3.307%
h3 2.865008494% 2.886435552% 2.836%
h4 4.135097853% 4.169312795% 3.905%
Na3 3.340918005% 3.34358958% 3.165%
Nc3 4.359712572% 4.345721062% 4.421%
Nf3 4.287321687% 4.302520666% 4.048%
Nh3 3.386944137% 3.393620795% 3.262%

SUM: ~ 100% ~ 100% 100.001%

Due to roundings: 100.001% instead of 100%; but the most important thing are the percentages of each initial move. It is noticeable that percentages of the same move regarding Perft(11) and Perft(12) are almost the same without exception. I have no reason to think that this issue will change in Perft(13). Although the percentages of Shawul's sample are quite good comparing with the other percentages, they are a little far from what is expected (IMO), but overall they give a great estimate. I think there is a general consensus on Perft(13) and, almost 100% sure: 1.979e+18 < Perft(13) < 1.982e+18.

So, in my meaningless opinion, Shawul's sample is quite accurate and he must run this method more than once... if he wants, obviously! This topic has brought very interesting estimates of everybody, so please keep the good work.

Regards from Spain.

Ajedrecista.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jul 29, 2011 7:35 pm

Ajedrecista wrote:Hi Daniel:

I took a look on your new method:

http://talkchess.com/forum/viewtopic.ph ... 49&t=39678

The Perft count seems very good (it almost fits in my new interval (1.9804681 ± 0.0002024)e+18, which is a total surprise for me!). I have calculated (I hope there are not errors) the percentage of each one of twenty first possible moves in your sample and also did the same with Perft(11) and Perft(12) (finally, there is not an 'odd-even effect' in percentages, maybe only in BF's). More less the percentages are these:

Moves: %[Perft(11)] %[Perft(12)] %(Shawul's sample)

a3 2.879568278% 2.904139798% 2.792%
a4 4.054742232% 4.092857514% 4.147%
b3 3.790445879% 3.830269713% 3.662%
b4 3.833779233% 3.837974861% 3.771%
c3 4.397087674% 4.377817668% 4.144%
c4 4.939128103% 4.963636424% 5.003%
d3 7.239429777% 7.300932114% 6.943%
d4 10.08667334% 10.06586933% 10.809%
e3 11.49259878% 11.39230718% 12.158%
e4 11.71984731% 11.55618883% 12.495%
f3 2.460575949% 2.470542707% 2.26%
f4 3.259476824% 3.262699557% 3.203%
g3 3.945500296% 3.975182931% 3.67%
g4 3.526143586% 3.528380922% 3.307%
h3 2.865008494% 2.886435552% 2.836%
h4 4.135097853% 4.169312795% 3.905%
Na3 3.340918005% 3.34358958% 3.165%
Nc3 4.359712572% 4.345721062% 4.421%
Nf3 4.287321687% 4.302520666% 4.048%
Nh3 3.386944137% 3.393620795% 3.262%

SUM: ~ 100% ~ 100% 100.001%

Due to roundings: 100.001% instead of 100%; but the most important thing are the percentages of each initial move. It is noticeable that percentages of the same move regarding Perft(11) and Perft(12) are almost the same without exception. I have no reason to think that this issue will change in Perft(13). Although the percentages of Shawul's sample are quite good comparing with the other percentages, they are a little far from what is expected (IMO), but overall they give a great estimate. I think there is a general consensus on Perft(13) and, almost 100% sure: 1.979e+18 < Perft(13) < 1.982e+18.

So, in my meaningless opinion, Shawul's sample is quite accurate and he must run this method more than once... if he wants, obviously! This topic has brought very interesting estimates of everybody, so please keep the good work.

Regards from Spain.

Ajedrecista.

Hi Jesus

Thanks for the encouragement. I think it is very much workable, I just do not have time to tune the parameters well. The major ones are

Code: Select all

UCTK &#58; parameter that balances exploration and exploitation
UCTN &#58; number of simulations before node expansion
K&#58; Number of simulations for each move at node creation

I did a quick test with upto depth 3 expansion and there doesn't seem to be a problem at all. Ofcourse I will keep working on it and update results.The result is still pretty close.
Edit: The output is very long so here is the link https://sites.google.com/site/dshawul/log.txt

cheers
Daniel

Ajedrecista · Post by **Ajedrecista** » Fri Jul 29, 2011 7:50 pm

Hi again Daniel:

Thanks for your quick response. When I read your post I can do nothing but laugh! I did not understand anything, it is like Chinese to me (no offense to Chinese people or language). I know nothing about programming, but thanks for trying to explain to somebody than maybe understands it.

I post these percentages because I think it could be useful for comparing, specially when dividing Perft into the first twenty moves. Now your new estimate 1.980549e+18 (do not know about your new percentages on each move, but sure I will not calculate them!) fits in my interval (it is ~ 0.004% greater than my central value). I have a good feeling on it... wait and see, there is nothing more that I could do.

Thank you very much for realising that your sample is tunable, but the question now is: how much it costs and how much is the benefit? Anyway, great work as usual. This topic has brought nice methods and I am glad I can participate here slightly.

Regards from Spain.

Ajedrecista.

sje · Post by **sje** » Sat Jul 30, 2011 3:43 am

A test point for the Monte Carlo practitioners:

I don't have perft(13) yet, but I've got parts of it.

[D]r1bqkbnr/pppp1ppp/2n5/4p3/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3

For the above ply 4 position, perft(9) is 23,814,128,415,915.

How well do your approximation algorithms do with this?

Adam Hair · Post by **Adam Hair** » Sat Jul 30, 2011 4:26 am

Adam Hair wrote: I have another Hocus Pocus estimation: 1.9803e+18

I ran a regression on the odd branching factors to obtain an overfitted equation, then used this equation to predict perft(13)/perft(12) = ~31.510522.

I am also deriving an estimate from sampling actual games. It will be much less accurate than Peter's and HGM's estimates, but may shed a little light on the subset of the total tree that has actually been played. For example, I have 37,643 unique positions at ply 12 derived from the rating lists. Mostly these positions come directly from the opening books and starting positions used over the years (most are 6 moves deep or more). The perft(1) calculation performed with these positions yielded an average of ~34.755, higher than the pretty certain estimate of ~31.5 from everybody's calculations (including François Labelle). Just a factoid.

Adam

I revise my guess to 1.98058e+18. This is based off of the following overfitted equation, which gives an estimate for the branching factor:

y = a + b*ln(x) + c*ln(x)^2 + d*ln(x)^3 + f*ln(x)^4 + g*ln(x)^5

a = 2.0000000000000000E+01
b = 1.1777189669120742E+00
c = 1.7001877870841087E-01
d = 4.6899880199910882E-01
f = 1.7030901119191894E-01
g = -7.1253155574964339E-02

I also checked samples from the FICS games Marcel van Kervinck has provided. Each sample consisted of ~ 50,000 unique positions at ply 12.
I then computed perft(1) for each position. The average perft(1) was greater than 34 for each sample.

Michel · Post by **Michel** » Sat Jul 30, 2011 6:28 am

Edit: The output is very long so here is the link https://sites.google.com/site/dshawul/log.txt

Thanks for the data.

It still bothers me that it seems hard to find mathematical reasons for
selecting which node to split.... Of course one should split a node whose
expected influence on the variance will be biggest. But I cannot really quantify
this because it seems hard to predict how much smaller the
variance will be after splitting.

Intuitively of course one should split a node with a big tree (or big weight).

Michel · Post by **Michel** » Sat Jul 30, 2011 9:15 am

I thought some more about a good node splitting heuristic. Here is what I came up with.

The idea is the following: we have proved that with ideal weights MC perft gives a variance of 0 (!). Of course choosing ideal weights depends on perfect knowledge of the tree, which we don't have (it is like
absolute zero temperature which requires infinite energy to reach). Nonetheless zero variance is the holy grail.

Suppose we are in a node with subtree sizes x_1,...,x_n. Our estimators
for the subtrees have variance s_1,...,s_n.

One can check that with optimal weights (proportional to l_i=sqrt(x_i^2+s_i^2)) the variance of the MC estimator in the node will
be

(l_1+...+l_n)^2- (x_1+...+x_n)^2

Note that if s_i=0 then this propagated variance is indeed zero.

We cannot change x_i, but we can change l_i by tree splitting.

By how much? The very best we can do is bring the variance down to zero.
In that case l_i becomes x_i.

So it seems reasonably to choose the tree splitting node in such
a way that

(l_1+...+l_n)^2-(l_1+...+l_{i-1}+x_i+l_{i+1}+...l_n)^2

is maximal.

A sensible approximation of this is maximizing

l_i-x_i.

If the s_i are big compared to x_i then this amounts to splitting the vertex
with the biggest variance. If s_i are small compared to x_i then this
amounts to splitting the vertex with the biggest ratio s_i^2/x_i
(by Taylor series expansion). So it is a somewhat subtle heuristic.

I have some free time coming up. So I will implement this (to test
the theory).

Perft(13) betting pool

Re: Perft(13) betting pool

Re: Perft(13) betting pool

Re: Perft(13) betting pool

Re: Perft(13) betting pool

Re: Perft(13) betting pool

Re: Perft(13) betting pool

A test point for the Monte Carlo practitioners

Re: Perft(13) betting pool

Re: Perft(13) betting pool

Re: Perft(13) betting pool