Perft(13) betting pool

Michel · Post by **Michel** » Wed Aug 10, 2011 5:00 pm

Wait a second, are you saying the weights are also summed too ? In that case you shouldn't have said I did it right when testing your formula... Because those weights do not have that behavior.

Well they should have this behavior in the limit.

But in finite time the two methods for updating the weights give different results.
I don't see any reason why one method should be better than the other since they are both approximations.

hgm · Post by **hgm** » Wed Aug 10, 2011 5:18 pm

Daniel Shawul wrote:I think it is easier to resolve this issue by running tests. Can you run a test one with your default acceptance probability of 1/32 ,and another with Peter's modification of selecting exactly one random move and multiply by the number of legal moves? The test should use a fixed time control selected in such a way that it is equivalent to the time it takes you to run 100 mil simulations. One test starting from depth=0 and another one from depth=5. Is that alright ?

I agree about the tests, but it is not so easy for me to implement Peter's method, since qperft counts in the leaves, and does not propagate counts to the root. But I don't think it is important to normalize by time; it should be good enough to count the number of nodes (move generations), and give the error as a function of that. We can easily do that each in our own methods.

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 10, 2011 5:34 pm

Michel wrote:
Wait a second, are you saying the weights are also summed too ? In that case you shouldn't have said I did it right when testing your formula... Because those weights do not have that behavior.
Well they should have this behavior in the limit.

But in finite time the two methods for updating the weights give different results.
I don't see any reason why one method should be better than the other since they are both approximations.

I think it they are two different weighing schemes and can not become the same in the limit. I may be wrong but take the following example with a weighting scheme of sqrt(mean):

Mean nodes count:
ply 3: (4,4,4,4) and (8,8)
ply 2: (16, 16)
ply 1: (32)

Weight updates:
ply 3: (2,2,2,2) and (sqrt(8), sqrt(8))
ply 2: (8, 2*sqrt(8))

Weight calculated from scratch.
ply 3: (2,2,2,2) and (sqrt(8), sqrt(8))
ply 2: (4,4)

So at ply 2 we have equal weights for both moves when calculated from scratch. But significantly different weight when back propagated.
from scratch: 4/8 = 0.5
back propagated: 8/(8 + 5.66) and 5.66 / (8 + 5.66)

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 10, 2011 5:36 pm

hgm wrote:
Daniel Shawul wrote:I think it is easier to resolve this issue by running tests. Can you run a test one with your default acceptance probability of 1/32 ,and another with Peter's modification of selecting exactly one random move and multiply by the number of legal moves? The test should use a fixed time control selected in such a way that it is equivalent to the time it takes you to run 100 mil simulations. One test starting from depth=0 and another one from depth=5. Is that alright ?
I agree about the tests, but it is not so easy for me to implement Peter's method, since qperft counts in the leaves, and does not propagate counts to the root. But I don't think it is important to normalize by time; it should be good enough to count the number of nodes (move generations), and give the error as a function of that. We can easily do that each in our own methods.

Ok. I will try to implement your method and do the test with both methods in the weekend. There should be some optimal value of N at least for the current perft(13) competition.

Michel · Post by **Michel** » Wed Aug 10, 2011 5:55 pm

I think it they are two different weighing schemes and can not become the same in the limit.

Let us define the perfect weight of a random variable as sqrt(mean^2+variance).

I can prove that if you have a series of random variables

X_1,....,X_n

and you apply the non uniform version of Peter's algorithm with the perfect weights
for the X_i then the perfect weight of the resulting random variable will be the sum of the perfect weights of the X_i.

Concretely if the weights of the children of a node converge to the perfect weights
then the weight of the node (updated using the propagated mean and variance) will converge to the sum of those weights.

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 10, 2011 6:08 pm

Michel wrote:
I think it they are two different weighing schemes and can not become the same in the limit.

Let us define the perfect weight of a random variable as sqrt(mean^2+variance).

I can prove that if you have a series of random variables

X_1,....,X_n

and you apply the non uniform version of Peter's algorithm with the perfect weights
for the X_i then the perfect weight of the resulting random variable will be the sum of the perfect weights of the X_i.

Concretely if the weights of the children of a node converge to the perfect weights
then the weight of the node (updated using the propagated mean and variance) will converge to the sum of those weights.

Well you said the sqrt is irrelevant but clearly the example says otherwise (unless you find something wrong with it). That was my objection. I know you want to discuss the case where variance goes to zero, so that you would have { sqrt(mean^2 + variance) = mean } thereby avoiding the sqrt! But that is just academic since we never ever reach there. If you keep on using the wrong weights in the mean time how are you going to bring down the variance quickly?

petero2 · Post by **petero2** » Wed Aug 10, 2011 9:55 pm

hgm wrote:
Daniel Shawul wrote:I think it is easier to resolve this issue by running tests. Can you run a test one with your default acceptance probability of 1/32 ,and another with Peter's modification of selecting exactly one random move and multiply by the number of legal moves? The test should use a fixed time control selected in such a way that it is equivalent to the time it takes you to run 100 mil simulations. One test starting from depth=0 and another one from depth=5. Is that alright ?
I agree about the tests, but it is not so easy for me to implement Peter's method, since qperft counts in the leaves, and does not propagate counts to the root. But I don't think it is important to normalize by time; it should be good enough to count the number of nodes (move generations), and give the error as a function of that. We can easily do that each in our own methods.

I would like to try your method. If I understood correctly, your estimate at a node is computed like this:

(number of legal moves) * (sum of visited children) / (number of visited children)

However, with a fixed move acceptance probability, it can happen that "number of visited children" is zero, so the above formula would give 0/0. How do you deal with that? Do you return a special value and then handle it in the parent node?

hgm · Post by **hgm** » Wed Aug 10, 2011 10:39 pm

No, I don't do any node estimates. The nodes are completely unaware of anything. I just have a global counter that counts leave nodes:

Code: Select all

int accepted&#91;10&#93;, rejected&#91;10&#93;, count, nodes;

perft&#40;n&#41;
&#123;
    nodes++;
    for&#40;ALL_MOVES&#41; &#123;
      if&#40;n==1&#41; count++; else
      if&#40;n <= SELECTIVITY && &#40;random&#40;) & 31&#41; != 0&#41; rejected&#91;n&#93;++; else &#123;
        accepted&#91;n&#93;++;
        MakeMove&#40;);
        perft&#40;n-1&#41;;
        UnMake&#40;);
    &#125;
&#125;

and afterwards I multiply count with (1 + rejected[n]/accepted[n]) for all n. As I never tried selectivity in the root, the chances that all moves are rejected at an entire level in astronomically small.

Btw, just to be sure I implemented Mersenne Twister (for again a20% slowdown)in stead of my crappy PRNG, and obtained the following empirical with 500M nodes:

Code: Select all

1/32 acceptace

3-ply full width&#58;
14021. mean = 6.286692e+16 SD =   2.664580% (  0.022503%) nodes=499991084
14022. mean = 6.286672e+16 SD =   2.664774% (  0.022504%) nodes=500025967

4-ply full width&#58;
 587. mean = 6.284372e+16 SD =   0.424240% (  0.017510%) nodes=499938176
 588. mean = 6.284380e+16 SD =   0.423888% (  0.017481%) nodes=500789873

5-ply full-width&#58;
  23. mean = 6.286163e+16 SD =   0.058986% (  0.012299%) nodes=479736315
  24. mean = 6.286130e+16 SD =   0.057748% (  0.011788%) nodes=500595954

6-ply full-width&#58;
   1. mean = 6.284515e+16 SD =       -nan% (      -nan%) nodes=510055595

SD is the empirical standard deviation in a single perft, in parenthesis the implied behind it the SD for the average of the entire run (so divided by sqrt(N) with N the number of runs, which is the first number on the line). The 6-ply full-width I did only once, so no empirical SD there.

As you can see, the randomness on the lower levels contributes significantly to the variance. If you force all moves at ply 4 to be picked equally often (by making that ply full-width), the SD drops from 0.0225% to 0.0175%, for same number of samples. That is a variance reduction of 37%.

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 11, 2011 12:54 am

My 10-hour run is not going to finish anytime soon now that I restarted it midway through because I realized I had the wrong parameter set. How long did it take you to do 500 mil simulation from depth = 0? For depth 0, I think the SD will be at least 1e13 for perft(12). I had 8.5e12 in a previous run when I expanded only root moves. So that means (8.5e12 / 6.285e16) * 100% = 0.0135% which puts it in between 4-ply and 5-ply result ?? But I doubt that is the case. Can you explain the formula you used to calculate the SD. Note that in our case back propagating variance to internal nodes assume CLT so that may cause some differences with your estimates. That is why I wanted to do 0 ply to avoid back propagation step. If the only the root is expanded we have 5 mil for each move so summing is fine I guess. Result will probably be due tomorrow if all goes well.

Michel · Post by **Michel** » Thu Aug 11, 2011 5:14 am

Well you said the sqrt is irrelevant but clearly the example says otherwise (unless you find something wrong with it).

The example is wrong since it does only one step. My theorem is for what happens in the limit.

The limit does not imply variance=0.

I could write out the proof (which is trivial) but mathematics on this forum is unreadable since it does not implement mathml.

Perft(13) betting pool

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?

Re: Defect of a node?