hgm wrote:I never really studied MC-UCT search, but WinBoard already implements something very similar with -mcBookMode true:
It defines a relation between move performance (% score) and relative playing frequency (i.e. relative to other moves from the same position). Currently I use a 4th power for that (so that moves with a 1% worse performance will be chosen 4% less often). Move choice in the book is based on these desired playing frequencies, and will pick the move for which the actual number of plays lags behind most compared to the desired number of plays. If all moves are played as much as they should (which I defined as that the most under-played move is played more than N - 0.5*sqrt(N) where N is the number of times it deserved to be played), a book miss is generated, so the engine can freely explore a move of its choice (to be added to the book if it wasn't already there).
This is a probability matching variant in which the frequency of moves is based on a function of mean reward (4th power for your case). UCB is better because it also takes into consideration the uncertainties we have in the rewards. Also its total regret is known to converge with a rate established as a lower bound, O(log n), so it solves the MAB problem. Epsilon-greedy and softmax selection also achieve that but UCB has its success for Computer Go on its side.
One problem is that both algorithms are deterministic. So if one good move happens to be played much less than it should have, it will constantly get picked up until the distribution is established! This is bound to happen with a PGN of games that will certainly not have a distribution in accordance with the selection procedure. Once the distribution is established, in every game a different move will be picked so deterministic maybe Ok after this. I think we can turn at-least the probability matching variants into non-deterministic using roulette-wheel random selection, but UCB can't be converted in such a way. Note that your algorithm also suffers from this problem as well since it picks the lowest N - 0.5*sqrt(N).
My first impression about code required is wrong. I think most books, including scorpios, can be turned to use UCB quite easily. Disk based look up of children moves, that also handles transpositions well, is a nice way as it is so there is no need for tree data structure.
When used with a randomizing engine, or a group of equally strong deterministic engines, this will build an opening tree.
To prevent rash decisions for rejecting moves, based on insufficient statistics, I add 5 wins and 5 losses to each move before calculating its desired number of plays.
UCB visits all children once before it uses its formula, which is not a good idea especially for book selection, because the first thing it does is play non-book moves. So initializing all of them and using the UCB formula right-away is a good idea.