A Simple Alpha(Go) Zero Tutorial
Moderators: hgm, Rebel, chrisw
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: A Simple Alpha(Go) Zero Tutorial
If you are not training a policy network (P=1), then C=1 should be fine.
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: A Simple Alpha(Go) Zero Tutorial
And if you are training a policy network ?Daniel Shawul wrote:If you are not training a policy network (P=1), then C=1 should be fine.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: A Simple Alpha(Go) Zero Tutorial
Well P(s,a) depends on the branching factor of the particular game. Say in chess you have 20 moves on average, then a uniform P(S,a) would be 1/20, and your C would have to be adjusted accordingly for the same level of exploration. There is a theoretical optimal C=sqrt(2) for pure UCT (i.e. no biases for moves) which has the sqrt(log(n)/n_i) formula in stead of sqrt(n)/1+n_i. A0 actually tunes C along with its other hyperparameters. I found very low exploration coefficient with minmax style updates works better for Chess that is more tactical than Go.
-
- Posts: 39
- Joined: Wed Dec 06, 2017 5:34 pm
Re: A Simple Alpha(Go) Zero Tutorial
This number should also depend on the number of simulations you're running, you're only going to get asymptotic behavior with some significant number of simulations. To extract useful information on a very low number of simulations you might need to lower the exploration constant quite a bit. If your searches are extremely shallow they might as well be greedy, if you don't do this the policy labels that come out will just look like uniform noise.Henk wrote:And if you are training a policy network ?Daniel Shawul wrote:If you are not training a policy network (P=1), then C=1 should be fine.
However, if you are using "some significant number" of simulations just ignore everything I just said. In any event, monitor the policy labels (this is N(s, a)/N(s) if you're following the paper) and see if you're happy with them, easiest way to tune this constant.