Google's AlphaGo team has been working on chess

CheckersGuy · Post by **CheckersGuy** » Sun Dec 10, 2017 2:47 am

syzygy wrote:
Milos wrote:
syzygy wrote:
CheckersGuy wrote:As for tuning the "regular" evaluation parameters and dont quite understand what you mean by "using mcts to train eval parameters"
Run random simulations to get some sort of "averaged" estimation of the evaluation, the idea being that the tactics of the position are washed out sufficiently by the nature of mcts. Then tune the eval weights to better predict that washed out tactic-free estimation.

I don't really expect it to work, but I wouldn't expect AlphaZero Chess to work, either
Alpha0 NN has only tactics of depth 8 encoded in weights and it works.
MCTS should be used for playing not training.
Here is what could actually work.
Instead of alpha-beta use identical MCTS as in Alpha0.
When you reach leaf node instead of only preforming eval perform optimized mpv (of moves that wouldn't be cut by LMR) depth 8 search + QS and return score. Transform the best move score into v and ratio of move scores into probability for each move to be played all using some linear function.
If you could for example perform this mpv depth 8 search+QS in 10ms, using 64 cores machine you'd have 6400 MCT searches performed each second. That is still quite shy from 80k performed in Alpha0, but with more powerful machine you'd get there.
And despite all the hype about Alpha0's NN eval, I really doubt that it is better than depth 8 mpv search + QS of SF.
I think this is more or less what Stéphane is now working on.

If I now understand the AlphaZero MCTS correctly, it is actually a sort of book-builder search. Build a tree by expanding leaf nodes and back up the results to the root, using some rule to select the next leaf node to be expanded.

If this works well with SF's alpha-beta search and evaluation, that would be very funny.

the article on wikipedia actually describes this fairly well https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

Basically, mcts consicts of 3 steps.

1.Selection
2.expansion
3.simulation

syzygy · Post by **syzygy** » Sun Dec 10, 2017 4:38 am

CheckersGuy wrote:the article on wikipedia actually describes this fairly well https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

But what AlphaZero is doing seems quite different from this:

Wikipedia wrote:The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random.

They are not playing out any games "to the very end". And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").

Basically, mcts consicts of 3 steps.

1.Selection
2.expansion
3.simulation

AlphaZero seems to use the term "simulation" for selection plus expansion. I see selection and I see expansion, but I don't see a separate "simulation".

AlvaroBegue · Post by **AlvaroBegue** » Sun Dec 10, 2017 5:05 am

syzygy wrote:[...] And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").

Randomness has never been a requirement in the tree part of the search. The earliest description I remember of this type of algorithm with MoGo's UCT search, which picked the move that maximizes a formula called UCB1 ("UCB" stands for "Upper Confidence Bound", just like in the recent paper). Actually I imagine most implementations don't have any random element in the tree part of the playout.

AlphaGo initially combined the results of playouts with the evaluation at the leaf of the tree (that was probably its biggest innovation), and then AlphaGo Zero and AlphaZero do away with the playout altogether, using only an evaluation function. Still, the whole mechanism for growing the tree and for backing up values is the same, so it feels like the MCTS label fits.

pilgrimdan · Post by **pilgrimdan** » Sun Dec 10, 2017 7:29 am

AlvaroBegue wrote:
syzygy wrote:[...] And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Randomness has never been a requirement in the tree part of the search. The earliest description I remember of this type of algorithm with MoGo's UCT search, which picked the move that maximizes a formula called UCB1 ("UCB" stands for "Upper Confidence Bound", just like in the recent paper). Actually I imagine most implementations don't have any random element in the tree part of the playout.

AlphaGo initially combined the results of playouts with the evaluation at the leaf of the tree (that was probably its biggest innovation), and then AlphaGo Zero and AlphaZero do away with the playout altogether, using only an evaluation function. Still, the whole mechanism for growing the tree and for backing up values is the same, so it feels like the MCTS label fits.

I found this as a (somewhat) explanation ...

QML 35 days ago [-]
What makes this different from a minimax algorithm with alpha-beta pruning?

gwern 35 days ago [-]

Aside from MCTS being a different tree search method, there is no 'closing of the loop'. In regular MCTS, it is far from unheard of to do the random playouts with instead some 'heavier' heuristic to make the playouts a little better estimators of the node value, but the heavy playouts do not do any kind of learning, the heuristic you start with is what you end with; what makes this analogous to policy iteration (hence the names for the Zero algorithm of 'tree iteration' or 'expert iteration') is that the refined estimates from the multiple heavy playouts are then used to improve the heavy playout heuristic (ie. a NN which can be optimized via backpropagation in a supervised learning of board position -> value). Then in a self-play setting, the MCTS continually refines its heavy heuristic (the NN) until it's so good that the NN+MCTS is superhuman. Then at play time you can drop the MCTS entirely and just use the heavy heuristic to do a very simple tree search to choose a move (which I think might actually be a minimax with a fixed depth but I forget).

https://news.ycombinator.com/item?id=15627340

Henk · Post by **Henk** » Sun Dec 10, 2017 12:23 pm

brianr wrote:
Henk wrote:That's fast show us some games.
??

If this was meant for me, it would take thousands of cpu/gpu (both) hours to even just begin to train any reasonable NN. It has been estimated that AlphaGo Zero would take something like about 1,700 years on typical consumer hardware. For chess, I only got it to start running, and am still trying to understand it. There is a Connect4 game he did also that is a simpler starting point. The author is considering a distributed approach, I think.

If alpha zero solution only works on hardware that is current after say thirty years then it is not much worth for us. Then I don't understand what the fuzz is all about.

CheckersGuy · Post by **CheckersGuy** » Sun Dec 10, 2017 12:51 pm

Henk wrote:
brianr wrote:
Henk wrote:That's fast show us some games.
??

If this was meant for me, it would take thousands of cpu/gpu (both) hours to even just begin to train any reasonable NN. It has been estimated that AlphaGo Zero would take something like about 1,700 years on typical consumer hardware. For chess, I only got it to start running, and am still trying to understand it. There is a Connect4 game he did also that is a simpler starting point. The author is considering a distributed approach, I think.
If alpha zero solution only works on hardware that is current after say thirty years then it is not much worth for us. Then I don't understand what the fuzz is all about.

30 years ?

I bet you a lot of money that we will have similiar performance (4tpu's) within 5 years

Henk · Post by **Henk** » Sun Dec 10, 2017 12:59 pm

CheckersGuy wrote:
Henk wrote:
brianr wrote:
Henk wrote:That's fast show us some games.
??

If this was meant for me, it would take thousands of cpu/gpu (both) hours to even just begin to train any reasonable NN. It has been estimated that AlphaGo Zero would take something like about 1,700 years on typical consumer hardware. For chess, I only got it to start running, and am still trying to understand it. There is a Connect4 game he did also that is a simpler starting point. The author is considering a distributed approach, I think.
If alpha zero solution only works on hardware that is current after say thirty years then it is not much worth for us. Then I don't understand what the fuzz is all about.
30 years ? I bet you a lot of money that we will have similiar performance (4tpu's) within 5 years

He talked about "1,700 years on typical consumer hardware". [At this moment I don't know what is the difference between cpu/gpu/tpu]

syzygy · Post by **syzygy** » Sun Dec 10, 2017 1:18 pm

AlvaroBegue wrote:
syzygy wrote:[...] And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Randomness has never been a requirement in the tree part of the search.

Yes, it seems the randomness refers just to the random playouts. But without any randomness left I don't see what the M and the C are still doing in AlphaZero's MCTS.

The earliest description I remember of this type of algorithm with MoGo's UCT search, which picked the move that maximizes a formula called UCB1 ("UCB" stands for "Upper Confidence Bound", just like in the recent paper). Actually I imagine most implementations don't have any random element in the tree part of the playout.

The "move probabilities" term in the paper initially suggested to me that the selection of the next node followed those probabilities. Apparently it does not, so why they are called probabilities is still rather unclear to me. (Probably it has something to do with the "confidence" in UCB.)

AlphaGo initially combined the results of playouts with the evaluation at the leaf of the tree (that was probably its biggest innovation), and then AlphaGo Zero and AlphaZero do away with the playout altogether, using only an evaluation function. Still, the whole mechanism for growing the tree and for backing up values is the same, so it feels like the MCTS label fits.

To me it feels more like the tree-expansion of proof-number search or of book-building algorithms. The innovation in MCTS, at least to me, lied solely in the random playouts (which should not work very well for chess).

CheckersGuy · Post by **CheckersGuy** » Sun Dec 10, 2017 2:01 pm

syzygy wrote:
AlvaroBegue wrote:
syzygy wrote:[...] And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Randomness has never been a requirement in the tree part of the search.
Yes, it seems the randomness refers just to the random playouts. But without any randomness left I don't see what the M and the C are still doing in AlphaZero's MCTS.

The earliest description I remember of this type of algorithm with MoGo's UCT search, which picked the move that maximizes a formula called UCB1 ("UCB" stands for "Upper Confidence Bound", just like in the recent paper). Actually I imagine most implementations don't have any random element in the tree part of the playout.
The "move probabilities" term in the paper initially suggested to me that the selection of the next node followed those probabilities. Apparently it does not, so why they are called probabilities is still rather unclear to me. (Probably it has something to do with the "confidence" in UCB.)

AlphaGo initially combined the results of playouts with the evaluation at the leaf of the tree (that was probably its biggest innovation), and then AlphaGo Zero and AlphaZero do away with the playout altogether, using only an evaluation function. Still, the whole mechanism for growing the tree and for backing up values is the same, so it feels like the MCTS label fits.
To me it feels more like the tree-expansion of proof-number search or of book-building algorithms. The innovation in MCTS, at least to me, lied solely in the random playouts (which should not work very well for chess).

You seem to forget that the "innovation" is mostly the neural network they use. There isn't anything special about mcts but apparently using a neural network + reinforcement learning instead of random playouts seems to work quite well.
Standard mcts doesn't really work for chess because random playouts aren't any good. The move probabilites (or whatever you wanna call it) from the NN provide a "soft-pruning". Completly stupid moves will have a very small probability and therefore won't be visited very often. Contrary to standard mcts that may visits those nodes way more often

syzygy · Post by **syzygy** » Sun Dec 10, 2017 2:34 pm

CheckersGuy wrote:
syzygy wrote:The innovation in MCTS, at least to me, lied solely in the random playouts (which should not work very well for chess).
You seem to forget that the "innovation" is mostly the neural network they use.

No, I simply meant the innovation in MCTS when MCTS was introduced (which seems to have been in the 1970s, but that does not matter here). The core element of MCTS is the random element. Without it, it makes no sense, to me, to still call it a Monte Carlo method. It is a best-first tree search algorithm, not a Monte Carlo tree search algorithm.

There isn't anything special about mcts but apparently using a neural network + reinforcement learning instead of random playouts seems to work quite well.

Taking the MC out of MCTS was probably necessary to make the AlphaZero approach work for chess.

I am still surprised that it works.
The neural network obviously gives quite good evaluations for the type of positions on which it was trained, but it seems it really needs enough tree search to do well against Stockfish. So the tree search may be part of the trick. Did Google simply discover that a book-building type algorithm for controlling the top of the tree outperforms alpha-beta when running on a highly parallel system? This is something that we can test. (The idea is not new, Marcel Kervinck mentioned it to me some years ago, but I don't know if anyone has tried it out on big enough hardware.)

Another idea is to try to get more out of self-play. Would it make sense to tune eval parameters such that, say, the results of a 10-ply search get closer to the results of a 15-ply search? Maybe this is only feasible with a ridiculously complex function (which then needs Google's hardware to be of any use).

Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess