Perft(13) betting pool

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

petero2
Posts: 688
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Defect of a node?

Post by petero2 »

Daniel Shawul wrote:This has become the new number one ! It improved the leader of both the limited splitting version and the free splitting version by the tiniest of margins possible :) But it clearly makes the point that the accepted concept that "picking the node with biggest variance" is wrong. It should have been defined "pick the node that results in biggest reduction in variance". The smallest node could be picked depending on the number of visits.. The improvement is minute but it is still an improvement ..

Limited split version
Mean 6.284999E+16
StdDev 5.023038E+12
Error 4.908E+12
Free split version:
Mean 6.284786E+16
StdDev 4.227828E+12
Error 7.04E+12
For comparison previous best had 5.024237E+12 and 4.228939E+12 , so you can see the improvements are only in the third significant in both cases! What matters is that it is consistent.
With my method, http://talkchess.com/forum/viewtopic.ph ... 25&t=39678, I get a standard deviation of 3.9826e+12 after 100e6 random walks. If I split the data in 100 chunks with 1e6 random walks in each chunk and compute the standard deviation for each chunk, I get: Image
This shows that the weights are still improving after 100e6 random walks. If I could make the algorithm learn the optimum weights faster, my standard deviation would be even lower.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Yet another piece of perft(13)

Post by sje »

And the second perft(11) subtotal:
[D]rnbqkbnr/1ppppppp/p7/8/8/2N5/PPPPPPPP/R1BQKBNR w KQkq - 0 2
perft(11): 2,458,021,022,756,805

That's two down and 398 to go.

Transposition table stores after 119 hours:

0 draft 12
2 draft 11
100 draft 10
1847 draft 9
28391 draft 8

Predicted completion date: December 2011
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Defect of a node?

Post by Daniel Shawul »

With my method, http://talkchess.com/forum/viewtopic.ph ... 25&t=39678, I get a standard deviation of 3.9826e+12 after 100e6 random walks. If I split the data in 100 chunks with 1e6 random walks in each chunk and compute the standard deviation for each chunk, I get:
This shows that the weights are still improving after 100e6 random walks. If I could make the algorithm learn the optimum weights faster, my standard deviation would be even lower.
First of all thanks for the data and plot. It helps to know I am on the right track. I will try to extract the standard deviation from the log file I have and make a plot. I suppose your method is the one which uses the hash tables for storing weights in the MC part of the tree alone. But I think also in your case you don't use full-width search depth (start from root), so you can also get some of the benefits.
Since the algorithms used are different I think that time comparison would be more appropriate. But I understand that is difficult to do right now and it is better to do the comparison with fixed simulations. My method doesn't actually use all of the 100 mil simulations. If you looked at the detailed data it only used 72mil simulations with about 30% wasted. I may be able to avoid that wastage because only reason I did it was to avoid bias of the mean in the previous method. So if that works,it should get the std down to the 3.9 range I guess. But haven't tested it with the new one yet. Also the splitting parameters are not tuned well and the current max depth reached by all methods is <=5. Anyway the method is not finished yet and clearly I have some problems to solve.
I suppose you have also issues to resolve too such as the slowdown due to HT. Have you compared it with the method which doesn't use HT in a fixed time ? Also after the disastrous result I start getting by using the same sequence of random numbers for calculating mean, I wonder if your method gets affected by it too?
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Defect of a node?

Post by Daniel Shawul »

hgm wrote:Well, like I said, I completely lost track of what you are doing. I was just curious what the errors you quote mean.
I admit my posts are hard to read since I post results as I get them. 100mil simulations takes me hours here. Anyway I think I have mentioned somewhere that the "stddev" is the standard deviation of the sampling distribution (aka standard error). And "Error" is just the error from the true mean (6.28549e16). The latter doesn't mean much because sometimes I get surprisingly good estimates from methods with higher stddev. I have it there just to keep an eye on the effect on the mean estimate , which can get tricky sometimes. Mean and stddev are calculated by "moving" average formula for each leaf node and then backed by by summation for the leaf nodes.
I will try to summarize in a table all the results I have including the ones I abandoned now.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Defect of a node?

Post by Daniel Shawul »

I got the plot now. And it is surprisingly very smooth. I guess that in your case the HT replacement is the cause of the hectic nature of the plot.
Image
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Defect of a node?

Post by Daniel Shawul »

Here is the all in one plot of the five free splitting versions I tested. All of them follow a 1/sqrt(N) trend but you can see that the uniform splitter stands out of the rest. And it has a "bump" somewhere around 4 million iterations for some reason. The rest ,except the log-tree size formula, pretty much overlap. I guess log-tree size is not appropriate since it has larger stddev in the earlier iterations but it was giving surprisingly low error from the actual mean for both the free splitting and limited splitting versions.
Image
petero2
Posts: 688
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Defect of a node?

Post by petero2 »

Daniel Shawul wrote:
With my method, http://talkchess.com/forum/viewtopic.ph ... 25&t=39678, I get a standard deviation of 3.9826e+12 after 100e6 random walks. If I split the data in 100 chunks with 1e6 random walks in each chunk and compute the standard deviation for each chunk, I get:
This shows that the weights are still improving after 100e6 random walks. If I could make the algorithm learn the optimum weights faster, my standard deviation would be even lower.
First of all thanks for the data and plot. It helps to know I am on the right track. I will try to extract the standard deviation from the log file I have and make a plot. I suppose your method is the one which uses the hash tables for storing weights in the MC part of the tree alone. But I think also in your case you don't use full-width search depth (start from root), so you can also get some of the benefits.
Since the algorithms used are different I think that time comparison would be more appropriate. But I understand that is difficult to do right now and it is better to do the comparison with fixed simulations. My method doesn't actually use all of the 100 mil simulations. If you looked at the detailed data it only used 72mil simulations with about 30% wasted. I may be able to avoid that wastage because only reason I did it was to avoid bias of the mean in the previous method. So if that works,it should get the std down to the 3.9 range I guess. But haven't tested it with the new one yet. Also the splitting parameters are not tuned well and the current max depth reached by all methods is <=5. Anyway the method is not finished yet and clearly I have some problems to solve.
I suppose you have also issues to resolve too such as the slowdown due to HT. Have you compared it with the method which doesn't use HT in a fixed time ? Also after the disastrous result I start getting by using the same sequence of random numbers for calculating mean, I wonder if your method gets affected by it too?
Actually I gave up on the hash table and now instead grow a tree from the root position. When I reach a leaf position, I first check if this is a good split candidate, and if it is, I split. When I reach a leaf node that I don't want to split, I do a random walk from there. Each tree node contains information which makes it possible to estimate "optimal weights", as described here http://talkchess.com/forum/viewtopic.ph ... 71&t=39678. When there is not enough information in the tree, I just use uniform sampling. This method has the advantage that it is unbiased even if I use the same walks to compute the sampling probabilities and to collect perft estimates.

Earlier I tried something which I think is similar to what you do, see http://talkchess.com/forum/viewtopic.ph ... 77&t=39678, but I also had the problem that my estimates were always far too low. I see you were able to overcome that problem by "wasting" 30% of the simulations. Would be interesting to see if that can be improved somehow.

How large do you allow your tree to grow? I assume your method gets better with a larger tree. I used 4 million tree nodes in my test.

I let my simulation run longer (600M walks) and the standard deviation over time looks like this:
Image
It looks like it stabilizes after 200M walks, with a standard deviation at around 3.5e13 for each 1M walks. It would of course be nice if I could somehow learn those weights much faster, but maybe it is not possible. After 100M walks, there has only been on average 25 visits for each leaf node in the tree.

Regarding time based comparisons, I will optimize my java implementation so that it at least doesn't compute zobrist hash keys and updates incremental evaluation terms. (I currently use unmodified code from my java chess program.)
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Defect of a node?

Post by Daniel Shawul »

Actually I gave up on the hash table and now instead grow a tree from the root position. When I reach a leaf position, I first check if this is a good split candidate, and if it is, I split. When I reach a leaf node that I don't want to split, I do a random walk from there. Each tree node contains information which makes it possible to estimate "optimal weights", as described here http://talkchess.com/forum/viewtopic.ph ... 71&t=39678. When there is not enough information in the tree, I just use uniform sampling. This method has the advantage that it is unbiased even if I use the same walks to compute the sampling probabilities and to collect perft estimates.
Isn't what you describe exactly UCT ? You first grow a tree from the root. Then you do a certain number of simulations say 10 for each node (uniform sampling that is) until you collect enough information. I do that too. Then you split based on the information you have which is the key component of UCT again . I can only understand that you probably have a new formula to select which node to split. UCT has UCB by default but I have already tested a dozen of formulas.
You never splitted in your hash table implementations but tried to optimize weights from the data you already calculated and stored. What exactly do you do differently in the upper part of the tree ?
Earlier I tried something which I think is similar to what you do, see http://talkchess.com/forum/viewtopic.ph ... 77&t=39678, but I also had the problem that my estimates were always far too low. I see you were able to overcome that problem by "wasting" 30% of the simulations. Would be interesting to see if that can be improved somehow.
Yes this I remember. And you said you are going back to your dynamic weight method (the HT method) unless someone comes up with the correct back up method.
And my solution for it is not the wasted nodes but to use a different sequence of random numbers than we used to calculate. I don't think "wasting" any simulations is necessary now. Earlier it was biasing results but after my improvement of using two parallel perfts, I don't need it anymore. Like I said I will be testing it next.
Here is what I described as a UCT-like perft
Here the idea for this kickass algorithm. You can say it is UCT for perft but I haven't actually implemented it so it can be all for nothing if I made some mistake in planning somewhere

Code: Select all

a&#41; Start from the root and do 10 Monte carlo simulations for each move
b&#41; Pick the best node to expand using UCB formula I provided above
c&#41; Expand the selected node which will be one of the biggest nodes with the right formula.
d&#41; Do this recursively for the children of that big node
So this algorithm explores the bigger nodes more often, while the smaller nodes will have their estimates from small number of simulations. No need for heuristic, no need for fixed depth expansion unlike the one we are doing i.e ply 3-6 fixed with uniform sampling.

If I have to write out my imagination, this algorithm will be so good as to be the best. But I could be wrong somewhere. Opinions ?
How large do you allow your tree to grow? I assume your method gets better with a larger tree. I used 4 million tree nodes in my test.

I let my simulation run longer (600M walks) and the standard deviation over time looks like this:
It can grow to any depth. If I reduce the number of simulations before splitting say from 1000 to 100 it will be selective and go deeper for some of the moves.
petero2
Posts: 688
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Defect of a node?

Post by petero2 »

My algorithm is quite similar to UCT but it is not exactly like UCT. Here is what I do:

1. Start with an in-memory tree with only one node representing the root position.

2. Do a random walk starting from the root, until a leaf node is reached. The probability controlling which child to visit are calculated from data in the child nodes. This probably causes some cache misses, but as a compensation I store moves in the tree nodes too, so I don't have to call the move generator until I reach a leaf node.

3. When a leaf node is reached, I decide if it should be split. A node is split if there are fewer than 4M nodes in the tree, the node has been visited at least 20 times, the remaining depth is > 1, and the "cost" for this node is at least 90% of the largest cost for a leaf node seen during the last 1000 random walks. "cost" is defined as sqrt(mean^2 + variance).

4. If a leaf node is split, child nodes are created for all legal moves.

5. If a leaf node is not split, I continue with a uniform probability random walk until the "perft" depth is reached. No hashing or tree node lookup is performed in this step. Just move generation and make/unmake moves.

6. As the recursion unwinds, the mean and variance estimates are updated for all visited in-memory tree nodes.

7. A single perft estimate computed from the just completed random walk is logged. If 1000 such values have been accumulated, print the average to standard output.

8. Go back to step 2.


I optimized my implementation a bit, and now it takes around 57 minutes on a 2.4 GHz Core2 CPU (using only one core) to generate 100e6 tree walks for perft(12). The code currently spends over 60% of the CPU time in a function that removes illegal moves from a pseudo-legal move list. I think this function can be optimized quite a lot though. Currently it just plays all moves and tests which moves leave the king in check.

The source code is here: http://web.comhem.se/petero2home/PerfTE ... 4.java.txt. If you actually want to run it, you also need BitBoard.java from CuckooChess.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Defect of a node?

Post by Daniel Shawul »

Well what you describe below is exactly UCT with Michel's formula AFAIAC. I had difficulty testing his formula because it lacked
the "visits" count for a node. Your step (3) decides if to split a node without comparing it to other sibling
nodes. To achieve that you needed more heuristic since you do not compare with other siblings to pick the best.
You just decide if a node is worth splitting by itself.
The basic idea of UCT tree search over plain MC (without tree) is to investigate (split) nodes worth splitting i.e bigger nodes for perft.
So when you do that you gain much of what UCT is meant for. The tree will dynamically grow towards the bigger nodes f.i in an LMR tree.
It will grow towards the best looking move for tree search. There are many methods to control how aggressively you want to control the growth of the tree..
I will go through your steps one by one:

My algorithm is quite similar to UCT but it is not exactly like UCT. Here is what I do:

1. Start with an in-memory tree with only one node representing the root position.
I start with the root node expanded (20 moves) but that is no difference. For a UCT search, you
have to return a move to play eventually so it is usual to start with a splitted root node.
2. Do a random walk starting from the root, until a leaf node is reached. The probability controlling which child to visit are calculated from data in the child nodes. This probably causes some cache misses, but as a compensation I store moves in the tree nodes too, so I don't have to call the move generator until I reach a leaf node.
This is the monte carlo part. A light MC visits all children uniformly while a heavy MC use different heuristics that usually make it slower but could be better overall.
BTW "heavy" and "light" is not my terminology. It is an accepted terminology in the GO programming field.
3. When a leaf node is reached, I decide if it should be split. A node is split if there are fewer than 4M nodes in the tree, the node has been visited at least 20 times, the remaining depth is > 1, and the "cost" for this node is at least 90% of the largest cost for a leaf node seen during the last 1000 random walks. "cost" is defined as sqrt(mean^2 + variance).
This is the formula part that you do differently. You do not compare to other siblings.
-> Your first criterion 4M nodes is to not run out of memory. There are two solution for this when used for tree search (a) Store less visited part of the tree on disk
(b) Prune parts of the tree. The latter is not applicable for perft because we need exact estimates.
-> Your second criterion UCT has it too. Mine is I think at 1000 right now. Lowering that should expand the tree more.
-> Your third criterion is Michel's formula a heuristic to have an idea of the cost of the node sqrt(mean^2 + variance)
4. If a leaf node is split, child nodes are created for all legal moves.
Same as UCT and a major component because you decide here to visit the bigger nodes. The tree
will be guided to the bigger nodes.
5. If a leaf node is not split, I continue with a uniform probability random walk until the "perft" depth is reached. No hashing or tree node lookup is performed in this step. Just move generation and make/unmake moves.
Same. A light MC search. If you start using your previous static heuristic for captures or the HT method, it will become a heavy MC.
6. As the recursion unwinds, the mean and variance estimates are updated for all visited in-memory tree nodes.
Same. I do exactly that and is easy for perft. But this back propagation step is more difficult
for tree search. Remi coulom had a complex Backup operator before UCT was invented.
7. A single perft estimate computed from the just completed random walk is logged. If 1000 such values have been accumulated, print the average to standard output.

8. Go back to step 2.
I optimized my implementation a bit, and now it takes around 57 minutes on a 2.4 GHz Core2 CPU (using only one core) to generate 100e6 tree walks for perft(12). The code currently spends over 60% of the CPU time in a function that removes illegal moves from a pseudo-legal move list. I think this function can be optimized quite a lot though. Currently it just plays all moves and tests which moves leave the king in check.

The source code is here: http://web.comhem.se/petero2home/PerfTE ... 4.java.txt. If you actually want to run it, you also need BitBoard.java from CuckooChess.
I think comparison by number of simulations should be fine. I was under the impression that it was the HT version you tested, in which case time tests would be more appropriate.