AlphaGo Zero

asanjuan · Post by **asanjuan** » Thu Oct 19, 2017 10:14 am

Looks like a milestone in IA.

https://deepmind.com/blog/alphago-zero- ... g-scratch/

They say:

Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.

asanjuan · Post by **asanjuan** » Thu Oct 19, 2017 10:18 am

Sorry, another thread posted it yesterday:

http://www.talkchess.com/forum/viewtopic.php?t=65481

Michael Sherwin · Post by **Michael Sherwin** » Fri Oct 20, 2017 11:35 am

The learning in AlphaGo Zero is said to be reinforcement learning. That is one of the two types of learning that is in RomiChess. See the chess programming wiki for RomiChess. That learning in RomiChess was inspired by the reinforcement learning experiment by Pavlov on dogs. A reward for correct actions and a penalty for incorrect actions. Adapting it for computer chess meant giving the moves of the winning side a slight reward and the moves of the losing side a slight penalty. And remembering the moves with their accumulated rewards and penalties in a 'game' tree stored on disk. It works very well!

If that type of reinforcement learning existed before it was released in RomiChess in early 2006 I have never heard of it. I read plenty of books on AI before 2006 and never read about it. Can anyone demonstrate that reinforcement learning in AI existed before 2006? If so is it similar to what is done in RomiChess or dissimilar?

Maybe I should just be a good little communist (I'm not) and not worry about getting my fair share of credit. At least the chess programming wiki recorded my contribution!

I had hoped to see my name in an AI book in the chapter on learning before I die but that is not looking very likely.

Michael Sherwin · Post by **Michael Sherwin** » Fri Oct 20, 2017 12:08 pm

Okay I looked again and I found MDP's first mentioned in 1957 the year that I was born, hmm. They mention a reward given after a very complicated formulation is conducted.

There is no mention that the idea originated with Pavlov's dog experiments. So is it similar or dissimilar to RomiChess I cannot tell. However, I guess it is similar in the fact that it is called reinforcement learning. It was first developed as far as I can tell for robots. In that it is not very similar to what is done in RomiChess. AlphaGo Zero sounds a lot like RomiChess though. Regardless however what I do in RomiChess is extremely simple as opposed to what is done in MDP's.

PK · Post by PK » Fri Oct 20, 2017 1:26 pm

As far as I understand, Romi learns things about good and bad moves relevant for that one game it is playing at the moment, using history tables to modify piece/square tables. It is nice, as it resembles human thinking in a way, binding search and evaluation together. But can You transfer this kind of knowledge to the next game?

I see this idea as at once brilliant and limited. In more general terms, I think that with a couple of exceptions, we are hopelessly outdated on this forum, or at least we are becoming outdated really quickly.

Take this example: Texel tuning is great (BTW Gaviota used something similar a bit earlier), but as far as I understand it, having no mathematical background, it is an equivalent of setting weights for a single neuron (that is, entire evaluation function is treated as one and taught in the same manner).

AlvaroBegue · Post by **AlvaroBegue** » Fri Oct 20, 2017 1:50 pm

PK wrote:[...]

Take this example: Texel tuning is great (BTW Gaviota used something similar a bit earlier), but as far as I understand it, having no mathematical background, it is an equivalent of setting weights for a single neuron (that is, entire evaluation function is treated as one and taught in the same manner).

Yes. It's also called logistic regression, and it dates back to 1958.

Michael Sherwin · Post by **Michael Sherwin** » Fri Oct 20, 2017 2:10 pm

What you mentioned about the history tables and the eval I guess is a type of learning on the fly and it does help a small amount. However, RomiChess has had end of game learning since January of 2006. In computer vs computer games the GUI sends a result command to the engine win, lose or draw. On receiving that command Romi overlays the moves played into a tree structure on disk and updates each move in the tree with win, lose or draw as well as a value for the accumulated reward or penalty. When a new game is played Romi follows along with the moves stored in the tree (an opening book). Before each move Romi first loads the reward/penalty values from the subtree into the hash. That affects which move Romi will play. It has been ten years at least since I looked at that code so pardon me if i'm not being precise as I should be. I guess that I should review Romi's learning code.

In computer vs human games ten years ago Arena would not send a result command to the engine. However, WinBoard did. Romi's end of game learning works like this. If the human player loses Romi will play the exact same moves back up to 180 ply until the human varies and wins. Then the reward/penalty will kick in to cause Romi to vary its play. From the opposite side Romi will play the humans winning line back until the human wins from the other side at which time once again the reward/penalty values will cause Romi to pick different moves. This is what Romi's learning was all about, but I never have gotten any feedback of anyone actually using Romi as designed. It just happens to be very effective in computer vs computer matches. And people have tested that quite a bit in the past.

Michael Sherwin · Post by **Michael Sherwin** » Fri Oct 20, 2017 2:58 pm

As a simple demonstration I have started a 1000 game match between RomiChess and Stockfish 8 at 4 seconds with a 500 millisecond increment using the first 5 positions of Silver50.pgn. I will report the results when I get up in the morning!

Edit: It is morning.

I'll make a report when I get up this afternoon.

jdart · Post by **jdart** » Fri Oct 20, 2017 4:19 pm

Arasan stores hash values for positions that were mis-predicted during the game and re-loads these on program start. This is sometimes called "permanent brain." This is a form of learning. But it does not work in UCI mode currently.

I used to have also book learning based on game results. This also didn't work with UCI because you don't get the game result. I took this out because it required a writeable opening book, which was complex, and also it creates issues in a multi-user environment (for example, especially on Linux, it is possible Arasan is installed in a directory where the current user does not have write permission).

--Jon

PK · Post by PK » Fri Oct 20, 2017 4:19 pm

A dan player whom I happen to know told me why AlphaGo Zero is a ground-breaking achievement. It's all about generalization. The number of games it played while learning was several orders of magnitude smaller than number of possible go positions after just 5 moves. And yet, starting from blank slate, this program managed to weed out all the wrong opening moves, learned local sequences established by human theory and went past them, learned the strategems and found some new ones etc.

I also looked at a few games of AlphaGo Zero. Don't take my opinion too seriously, because I am only an inactive 6 kyu player, but its playing style is strikingly unique. In the early game most of its groups are unsettled, it keeps the options of strenghtening them or abandoning them for a really long time. It looks a bit like a high dan player playing White in a handicap game, except that the game started out even and the moves that look risky do not disturb the equilibrium, because there is no need for that.

AlphaGo Zero

AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero

Re: AlphaGo Zero