Article:"How Alpha Zero Sees/Wins"

supersharp77 · Post by **supersharp77** » Wed Jan 17, 2018 4:50 pm

New article discusses how Alpha Zero thinks..calculates variations

http://www.danamackenzie.com/blog/?p=5072

"So far I have looked at three games from the AlphaZero-Stockfish match: #5, #9, and #10 from the ten games provided in the arXiv preprint. All three are amazingly similar, and at the same time they are amazingly unlike almost any other game I’ve ever seen. In each case AlphaZero won by sacrificing a piece for compensation that didn’t fully emerge until at least 15 or 20 moves later."

"How does AlphaZero avoid the horizon effect? To evaluate a position, it simply plays hundreds of random games from that position. To you or me this may seem like a crazy idea, but actually it makes a certain amount of sense."

"Are you sure AlphaZero plays many random games out to the end? That is how Monte Carlo Tree Search Go bots used to work before AlphaGo, but I was under the impression AlphaZero doesn’t calculate all the way to the end of the game any more. Doesn’t it explore a range of plausible moves to a certain depth, and then evaluate the resulting position using its network?"

"Where the first AlphaGo indeed did play out the entire game – this is something that makes more sense in Go than in Chess – The next instance of AlphaGo, namely AlphaGoZero did not. Instead AlphaGoZero has the ability to judge positions (the value network) and does not need to play out the entire game. AlphaZero again is based on AlphaGoZero and also has a value network.

Beside the value network, the neural net has another output: The policy network that decide what moves are most likely good. The “random playouts” AlphaZero uses are not completely random, they are based on the scores given by the policy network. Over a single playout game it does indeed play randomly, if the policy net say “this move is 1% good”, it might just randomly play that 1% move. But over the 80.000 moves it played, 79.000 of them will be the move a grandmaster (well, AlphaZero) would prefer. So instead of writing “If we see that White usually wins the position if it’s played by weaker players”, it actually is “White usually wins the position if it’s played by grandmasters” "

"Thanks for the in-depth explanation! In particular I appreciate the explanation of the terms “value network” and “policy network,” which I didn’t fully understand from the AlphaZero and AlphaGo papers."

Is this correct? Thx AR

shrapnel · Post by **shrapnel** » Wed Jan 17, 2018 5:26 pm

Great Article.
Really explains things very well.
Indirectly it also indicates that the old Alpha-Beta Engines like Stockfish, Komodo and Houdini can never hope to match AlphaZero's approach to computer chess.

Milos · Post by **Milos** » Wed Jan 17, 2018 6:17 pm

supersharp77 wrote:Over a single playout game it does indeed play randomly, if the policy net say “this move is 1% good”, it might just randomly play that 1% move. But over the 80.000 moves it played, 79.000 of them will be the move a grandmaster (well, AlphaZero) would prefer.

Other stuff are correct, but this is entirely wrong.
It does not randomly play 1% move. It plays 1% move when other more probable moves are explored sufficient number of times. The algorithm is completely deterministic.
And 80000 moves that are played are leaf expansions. Root moves are explored (played if you like) in proportion to their respective probability given by the policy network.

syzygy · Post by **syzygy** » Wed Jan 17, 2018 8:18 pm

supersharp77 wrote:"How does AlphaZero avoid the horizon effect? To evaluate a position, it simply plays hundreds of random games from that position. To you or me this may seem like a crazy idea, but actually it makes a certain amount of sense."

Wrong.

"Are you sure AlphaZero plays many random games out to the end? That is how Monte Carlo Tree Search Go bots used to work before AlphaGo, but I was under the impression AlphaZero doesn’t calculate all the way to the end of the game any more. Doesn’t it explore a range of plausible moves to a certain depth, and then evaluate the resulting position using its network?"

Correct.

"Where the first AlphaGo indeed did play out the entire game – this is something that makes more sense in Go than in Chess – The next instance of AlphaGo, namely AlphaGoZero did not. Instead AlphaGoZero has the ability to judge positions (the value network) and does not need to play out the entire game. AlphaZero again is based on AlphaGoZero and also has a value network.

Correct (but in my understanding AlphaGo also had a "value network", it just somehow combined the value network with random playouts).

Beside the value network, the neural net has another output: The policy network that decide what moves are most likely good. The “random playouts” AlphaZero uses are not completely random, they are based on the scores given by the policy network.

This is correct for the selection of the "plausible moves to a certain depth". The "certain depth" is not fixed, though. AlphaZero builds a tree of "plausible moves", then keeps traversing that tree from the root node (= board position) until it reaches a leaf node. That leaf node is expanded with the help of the neural network.

In AlphaZero, the value network and the policy network are combined in a single neural network. AlphaGo had two separate networks.

Over a single playout game it does indeed play randomly, if the policy net say “this move is 1% good”, it might just randomly play that 1% move. But over the 80.000 moves it played, 79.000 of them will be the move a grandmaster (well, AlphaZero) would prefer.

The 80,000 nps number from the preprint refers to the number of tree expansions (= number of NN evaluations) per second. Each tree expansion is preceded by a series of move selections from the current position (root node of the tree) to a leaf node. Those move selections are not random in my understanding, although some random noise might be used there.

If I understand things correctly, the term "move probability" is completely misleading. (Just like "Monte Carlo Tree Search" is completely misleading if AlphaZero is described.)

Uri Blass · Post by **Uri Blass** » Thu Jan 18, 2018 6:30 am

shrapnel wrote:Great Article.
Really explains things very well.
Indirectly it also indicates that the old Alpha-Beta Engines like Stockfish, Komodo and Houdini can never hope to match AlphaZero's approach to computer chess.

I do not see it.

There is no fair comparison between stockfish and alphazero and
I guess it may be possible to improve stockfish by adding part of playing games against itself without stopping to use alpha-beta.

You only need to change the static evaluation and the static evaluation of a leaf position may be some combination of the static evaluation of stockfish and the result of games that stockfish play against itself that are not exactly random games and good moves have bigger probability.

I believe this idea should be productive when the time control is long enough but I also suspect it is not good enough to pass the framework that test only at bullet time control.

corres · Post by **corres** » Thu Jan 18, 2018 10:56 am

[quote="supersharp77"]

New article discusses how Alpha Zero thinks..calculates variations

[/quote]

I wait for a really technical scientific description about A0 from the team of DeepMind.
The loft is fully with speculations.
I think it is the very time for this....

Article:"How Alpha Zero Sees/Wins"

Article:"How Alpha Zero Sees/Wins"

Re: Article:"How Alpha Zero Sees/Wins"

Re: Article:"How Alpha Zero Sees/Wins"

Re: Article:"How Alpha Zero Sees/Wins"

Re: Article:"How Alpha Zero Sees/Wins"

Re: Article:"How Alpha Zero Sees/Wins"