A Simple Alpha(Go) Zero Tutorial

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A Simple Alpha(Go) Zero Tutorial

Post by Daniel Shawul »

If you are not training a policy network (P=1), then C=1 should be fine.
Henk
Posts: 7218
Joined: Mon May 27, 2013 10:31 am

Re: A Simple Alpha(Go) Zero Tutorial

Post by Henk »

Daniel Shawul wrote:If you are not training a policy network (P=1), then C=1 should be fine.
And if you are training a policy network ?
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A Simple Alpha(Go) Zero Tutorial

Post by Daniel Shawul »

Well P(s,a) depends on the branching factor of the particular game. Say in chess you have 20 moves on average, then a uniform P(S,a) would be 1/20, and your C would have to be adjusted accordingly for the same level of exploration. There is a theoretical optimal C=sqrt(2) for pure UCT (i.e. no biases for moves) which has the sqrt(log(n)/n_i) formula in stead of sqrt(n)/1+n_i. A0 actually tunes C along with its other hyperparameters. I found very low exploration coefficient with minmax style updates works better for Chess that is more tactical than Go.
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: A Simple Alpha(Go) Zero Tutorial

Post by trulses »

Henk wrote:
Daniel Shawul wrote:If you are not training a policy network (P=1), then C=1 should be fine.
And if you are training a policy network ?
This number should also depend on the number of simulations you're running, you're only going to get asymptotic behavior with some significant number of simulations. To extract useful information on a very low number of simulations you might need to lower the exploration constant quite a bit. If your searches are extremely shallow they might as well be greedy, if you don't do this the policy labels that come out will just look like uniform noise.

However, if you are using "some significant number" of simulations just ignore everything I just said. In any event, monitor the policy labels (this is N(s, a)/N(s) if you're following the paper) and see if you're happy with them, easiest way to tune this constant.