Anyone heared of MuZero ?

Damir · Post by **Damir** » Mon Nov 25, 2019 1:40 pm

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

1 IntroductionPlanning algorithms based on lookahead search have achieved remarkable successes in artificial intelligence. Hu-man world champions have been defeated in classic games such as checkers [34], chess [5], Go [38] and poker[3, 26], and planning algorithms have had real-world impact in applications from logistics [47] to chemical syn-thesis [37]. However, these planning algorithms all rely on knowledge of the environment’s dynamics, such as therules of the game or an accurate simulator, preventing their direct application to real-world domains like robotics,industrial control, or intelligent assistants.Model-based reinforcement learning (RL) [42] aims to address this issue by first learning a model of theenvironment’s dynamics, and then planning with respect to the learned model. Typically, these models have eitherfocused on reconstructing the true environmental state [8, 16, 24], or the sequence of full observations [14, 20].However, prior work [4, 14, 20] remains far from the state of the art in visually rich domains, such as Atari 2600games [2]. Instead, the most successful methods are based on model-free RL [9, 21, 18] – i.e. they estimatethe optimal policy and/or value function directly from interactions with the environment. However, model-freealgorithms are in turn far from the state of the art in domains that require precise and sophisticated lookahead, suchas chess and Go.In this paper, we introduceMuZero, a new approach to model-based RL that achieves state-of-the-art per-formance in Atari 2600, a visually complex set of domains, while maintaining superhuman performance in pre-cision planning tasks such as chess, shogi and Go.MuZerobuilds uponAlphaZero’s [39] powerful search and
search-based policy iteration algorithms, but incorporates a learned model into the training procedure.MuZeroalso extendsAlphaZeroto a broader set of environments including single agent domains and non-zero rewards atintermediate time-steps.The main idea of the algorithm (summarized in Figure 1) is to predict those aspects of the future that are directlyrelevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as aninput and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process thatreceives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts thepolicy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the pointsscored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating thesethree important quantities, so as to match the improved estimates of policy and value generated by search as wellas the observed reward. There is no direct constraint or requirement for the hidden state to capture all informationnecessary to reconstruct the original observation, drastically reducing the amount of information the model hasto maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of theenvironment; nor any other constraints on the semantics of state. Instead, the hidden states are free to representstate in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent caninvent, internally, the rules or dynamics that lead to most accurate planning.

* Assuming the "steps" they refer to are games, this is incredibly impressive, as MuZero was able to match AlphaZero in only about a million games whereas AlphaZero took 44 million... Wow !

https://arxiv.org/pdf/1911.08265.pdf

PS: I have posted this post by a user on new mzchessforum. He calls himself ZamChess.

shrapnel · Post by **shrapnel** » Mon Nov 25, 2019 2:38 pm

Stale News, Damir bro.
http://talkchess.com/forum3/viewtopic.php?f=2&t=72381
Yep, its the final Nail in Stockfish's Coffin.
Stockfish...Rest In Peace (RIP)..(or is it Pieces)

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Mon Nov 25, 2019 2:40 pm

>Assuming the "steps" they refer to are games

It's training steps of 2048 minibatches. I don't see any mention of games - they run the processes asynchronously. But it seems they generated about 20 billion positions. So, probably quite a bit more than a million games.

mclane · Post by **mclane** » Mon Nov 25, 2019 2:53 pm

Sounds interesting

Anyone heared of MuZero ?

Anyone heared of MuZero ?

Re: Anyone heared of MuZero ?

Re: Anyone heared of MuZero ?

Re: Anyone heared of MuZero ?