AlphaGo Zero And AlphaZero, RomiChess done better

Michael Sherwin · Post by **Michael Sherwin** » Thu Dec 07, 2017 12:45 pm

In January of 2006 IIRC (not exactly sure) I released RomiChess ver P2a. The new version had learning. It had two types of learning, monkey see monkey do and learning adapted from Pavlov's dog experiments. I did not know it at the time but the second type of learning was called reinforcement learning. I just found out very recently that reinforcement learning was invented for robotics control in 1957 the year that I was born, strange. Anyway, as far as I know I reinvented it and was the first to put reinforcement learning into a chess program. The reason i'm apparently patting myself on the back is rather to let people know that I recognise certain aspects of this AlphaZero phenom. For example, using Glaurung 2.x as a test opponent Romi played 20 matches against Glaurung using the ten Nunn positions. On pass one Romi scored 5% against Glaurung. On the 20th pass Romi scored 95%. That is how powerful the learning is! The moves that Romi learned to beat Glaurung were very distinctive looking. They are learned moves so they are not determined by a natural chess playing evaluation but rather an evaluation tweaked by learned rewards and penalties. Looking at the games between AlphaZero and Stockfish I see the same kind of learned moves. In RomiChess one can start with a new learn.dat file and put millionbase.pgn in the same directory as Romi and type merge millionbase.pgn and Romi will learn from all those games. When reading about AlphaZero there is mostly made up reporting. That is what reporters do. They take one or two known facts and make up a many page article that is mostly bunk. The AlphaZero team has released very little actual info. They released that it uses reinforcement learning and that a database of games were loaded in. Beyond that not much is known. But looking at the games against Stockfish it looks as though AlphaZero either trained against Stockfish before the recorded match or entered a pgn of Stockfish games. Stockfish does have some type of randomness to its moves so it can't be totally dominated like Romi dominated Glaurung that had no randomness. So basically take an engine about as strong as Stockfish and give it reinforcement learning and the result is exactly as expected!

abulmo2 · Post by **abulmo2** » Thu Dec 07, 2017 1:16 pm

Michael Sherwin wrote:But looking at the games against Stockfish it looks as though AlphaZero either trained against Stockfish before the recorded match or entered a pgn of Stockfish games.

The article doest not say so. AlphaZero used a "tabula rasa reinforcement learning algorithm. [...] AlphaZero learns these move probabilities and value estimates entirely from self-play; these are then used to guide its search. [...] The parameters θ of the deep neural network in AlphaZero are trained by self-play reinforce-
ment learning, starting from randomly initialised parameters θ"

Michael Sherwin · Post by **Michael Sherwin** » Thu Dec 07, 2017 1:29 pm

They can say or omit anything. One article definitely stated that a database of human games were loaded. I don't know what article is correct. I just know from years of experience with that type of learning that the moves in the games by AlphaZero against Stockfish has the same type and feel. Take what I said with a grain of salt but take what is said in those articles with a grain of salt as well.

Edit: Think about it in these terms. It took 4 hours for AlphaZero to amass all human knowledge about the game of chess. At one minute per move maybe 4 games of self play could have been accomplished. However, if 6 million games were loaded in and analysed then 4 hours is about the time it would take. Self play would take years to get to that level. 'Their' story does not add up!

kranium · Post by **kranium** » Thu Dec 07, 2017 3:26 pm

Michael Sherwin wrote:They can say or omit anything. One article definitely stated that a database of human games were loaded. I don't know what article is correct. I just know from years of experience with that type of learning that the moves in the games by AlphaZero against Stockfish has the same type and feel. Take what I said with a grain of salt but take what is said in those articles with a grain of salt as well.

Edit: Think about it in these terms. It took 4 hours for AlphaZero to amass all human knowledge about the game of chess. At one minute per move maybe 4 games of self play could have been accomplished. However, if 6 million games were loaded in and analysed then 4 hours is about the time it would take. Self play would take years to get to that level. 'Their' story does not add up!

No human games were loaded. Learning was accomplished thru millions of self-play games
The monte carlo search algorithm simply chose the move in each position with the highest win probability.

Code: Select all

Mini-batches 700k 700k 700k
Training Time 9h 12h 34h
Training Games 44 million 24 million 21 million
Thinking Time 800 sims 800 sims 800 sims
40 ms 80 ms 200 ms

Table S3: Selected statistics of AlphaZero training in Chess, Shogi and Go

They used 5,000 first-generation TPUs to generate self-play games.
and 64 second-generation TPUs to train the neural networks.

They ended up with 44 million training games.

kranium · Post by **kranium** » Thu Dec 07, 2017 3:34 pm

There's a version of AlphaZero available here:
https://github.com/Zeta36/chess-alpha-zero

for anyone interested

schack · Post by **schack** » Thu Dec 07, 2017 3:40 pm

It's a completely different project that uses some of the techniques from Alpha Zero. Not the same thing.

corres · Post by **corres** » Thu Dec 07, 2017 3:58 pm

[quote="kranium"]

There's a version of AlphaZero available here:
https://github.com/Zeta36/chess-alpha-zero
for anyone interested

[/quote]

Please read in the "readme.md" paragraph about "New Supervised Learning Pipeline".
You can read that human games was used before starting the learning process in the case of Zeta36 and AlphaZero too.
"...maybe chess is too complicated for a self training alone.."

kranium · Post by **kranium** » Thu Dec 07, 2017 4:00 pm

schack wrote:It's a completely different project that uses some of the techniques from Alpha Zero. Not the same thing.

yes that's why I said 'a version'

Steve Maughan · Post by **Steve Maughan** » Thu Dec 07, 2017 4:43 pm

I remember the experiments at the time. Could you briefly explain what you did? From memory I recall you did the following:

At the end of the game you parsed the list of moves and adjusted the score up or down a certain number of centipawns based on the outcome. You then hashed each position and stored it in a learning file. I assume this is then loaded into the hash table at the start of each game. Is this broadly correct?

Thanks,

Steve

Milos · Post by **Milos** » Thu Dec 07, 2017 5:50 pm

kranium wrote:No human games were loaded. Learning was accomplished thru millions of self-play games
The monte carlo search algorithm simply chose the move in each position with the highest win probability.

How do you explain these paragraphs from the paper:

"Training proceeded for 700,000 steps (mini-batches of size 4,096) starting from randomly initialised parameters"

"We represent the policy π(a|s) by a 8 × 8 × 73 stack of planes encoding a probability distribution over 4,672 possible moves. Each of the 8×8 positions identiﬁes the square from which to “pick up” a piece."

"The number of games, positions, and thinking time varied per game due largely to different board sizes and game lengths, and are shown in Table S3."

So when playing self-played games positions used for training are taken from the games randomly (since position is part of set of training parameters). So what about starting positions of those 44 million training games? You think they were all random, or initial starting position and they had no chess knowledge in them????
Give me a break, thinking those ppl in Google are so stupid to train their network in such a lousy way, instead of sorting those 100'000 openings from the same chessbase they quote in the paper by probability of occurrence and using those statistics as starting positions for those self-played games.
Ofc in Table 2 they nicely show just percentages not actual numbers so you can't judge how many training games in total were from the starting position, because someone could be smart and sum up all those games from Table 2 and figure the number doesn't match 44 million...

Btw. 700'000 training iterations times 800 MTCS is already 56 million, not 44, so where did 12 million games disappear?

AlphaGo Zero And AlphaZero, RomiChess done better

AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better

Re: AlphaGo Zero And AlphaZero, RomiChess done better