Google's AlphaGo team has been working on chess

Rein Halbersma · Post by **Rein Halbersma** » Mon Dec 11, 2017 11:18 pm

hgm wrote: Now we know that neural networks are a quite inefficient way to solve problems. Their only virtue is that they can be trained to do the job, which is very helpful when the programmer doesn't know how to do the job himself. Usually you need a large overkill in terms of neurons, to make it reasonably sure that the connections you turn out to need will accidentally be in the network. So it is very likely that the neural network that came out can eventually be simplified a few orders of magnitude by explicitly programmed calculation of intermediate results it seems to use in its calculation, and now derives only at great cost. (Imagine how many neurons you would need to recognize a Knight attacks a Queen somewhere on the board, and suggest making the capture is a good idea, and how many operations it would require to 'run' such a network, compared to how long a program written to do just that would need for it.)

The AlphaGoZero network was huge, with 46 million parameters and 80 layers. But if you want whole-board interactions on a chess board to be modeled, you need at least a handful of layers (they're using 3x3 convolutions, so you need at least 4 convolutions to get every square being connected to every other square in the top layer). This still requires 2.4 million parameters. And, at least for Go, even 20 vs 40 layers made a big difference in the strength that could be obtained. I'm not saying big savings can't be found, but it's not gonna be easy.

AlvaroBegue · Post by **AlvaroBegue** » Mon Dec 11, 2017 11:25 pm

Rein Halbersma wrote:
hgm wrote: Now we know that neural networks are a quite inefficient way to solve problems. Their only virtue is that they can be trained to do the job, which is very helpful when the programmer doesn't know how to do the job himself. Usually you need a large overkill in terms of neurons, to make it reasonably sure that the connections you turn out to need will accidentally be in the network. So it is very likely that the neural network that came out can eventually be simplified a few orders of magnitude by explicitly programmed calculation of intermediate results it seems to use in its calculation, and now derives only at great cost. (Imagine how many neurons you would need to recognize a Knight attacks a Queen somewhere on the board, and suggest making the capture is a good idea, and how many operations it would require to 'run' such a network, compared to how long a program written to do just that would need for it.)
The AlphaGoZero network was huge, with 46 million parameters and 80 layers. But if you want whole-board interactions on a chess board to be modeled, you need at least a handful of layers (they're using 3x3 convolutions, so you need at least 4 convolutions to get every square being connected to every other square in the top layer). This still requires 2.4 million parameters. And, at least for Go, even 20 vs 40 layers made a big difference in the strength that could be obtained. I'm not saying big savings can't be found, but it's not gonna be easy.

You can use dilation to make the receptive field much larger. WaveNet is a good example of this, and recently some people are using dilation instead of pooling for image recognition too.

By using dilation, you can cover the board in logarithmically-many layers instead of linearly-many layers, as a function of board size. It's probably worth exploring.

hgm · Post by **hgm** » Tue Dec 12, 2017 12:01 am

Rein Halbersma wrote:The AlphaGoZero network was huge, with 46 million parameters and 80 layers. But if you want whole-board interactions on a chess board to be modeled, you need at least a handful of layers (they're using 3x3 convolutions, so you need at least 4 convolutions to get every square being connected to every other square in the top layer). This still requires 2.4 million parameters. And, at least for Go, even 20 vs 40 layers made a big difference in the strength that could be obtained. I'm not saying big savings can't be found, but it's not gonna be easy.

But the point is that I want to eliminate the NN altogether. It was nice to have it for training. But most people don't want an engine to train it, they want an engine to use it. So it would be fine to replace the NN by dedicated, immutable code that calculates numbers sufficiently similar to the outputs of the fully trained NN. Whole-board interactions (e.g. detecting (soft-)pins, X-ray attacks, skewers, discovered threats etc.) are not particularly difficult to calculate with dedicated purposeful code. A NN is pretty dumb, and would basically have to examine every ray on the board for the sought pattern of piece types and empty squares, which can occur in many different positions along the ray. While the pattern might be possible only in one location at the same time, due to the limited number of pieces (say a K and opposing Q with an N in between), which is trivial to find when you know where your pieces are located. A routine desiged to detect pins would be orders of magnitude faster than the simulation of a neural net designed for that purpose on a normal CPU. Let alone a random neural net large enough to be able to lear it through training.

And we can assumes that it is important to detect pins in order to reach the level of play AlphaZero does. And a host of other concepts from classical Chess strategy and tactics. As a good start we could try feeding a NN with efficiently extracted information that we suspect might be useful, such as Pawn structure, passers, Squares attacked near King, mobility. game phase, LxH capture threats, possible checking moves. That should allow a vastly smaller NN to compute the same as what the AlphaZero NN calculates from just the piece locations.

CheckersGuy · Post by **CheckersGuy** » Tue Dec 12, 2017 12:16 am

hgm wrote:
Rein Halbersma wrote:The AlphaGoZero network was huge, with 46 million parameters and 80 layers. But if you want whole-board interactions on a chess board to be modeled, you need at least a handful of layers (they're using 3x3 convolutions, so you need at least 4 convolutions to get every square being connected to every other square in the top layer). This still requires 2.4 million parameters. And, at least for Go, even 20 vs 40 layers made a big difference in the strength that could be obtained. I'm not saying big savings can't be found, but it's not gonna be easy.
But the point is that I want to eliminate the NN altogether. It was nice to have it for training. But most people don't want an engine to train it, they want an engine to use it. So it would be fine to replace the NN by dedicated, immutable code that calculates numbers sufficiently similar to the outputs of the fully trained NN. Whole-board interactions (e.g. detecting (soft-)pins, X-ray attacks, skewers, discovered threats etc.) are not particularly difficult to calculate with dedicated purposeful code. A NN is pretty dumb, and would basically have to examine every ray on the board for the sought pattern of piece types and empty squares, which can occur in many different positions along the ray. While the pattern might be possible only in one location at the same time, due to the limited number of pieces (say a K and opposing Q with an N in between), which is trivial to find when you know where your pieces are located. A routine desiged to detect pins would be orders of magnitude faster than the simulation of a neural net designed for that purpose on a normal CPU. Let alone a random neural net large enough to be able to lear it through training.

It would obviously be nice to deduce what the neural network has actually learned but how would you go about that ?

The reason why deep NN's are used is that they can learn complex/abstract functions. What you describe isn't very far from a linear evaluation function which is not very complex at all. The NN from AlphaZero probably learned more abstract concepts than we could explicitly code. For example, stockfish may need a 20 ply search to see that he is dead lost and A0 only needs one call to the neural network because it has learned that those positions are lost.
It would be very hard to explicitly encode this concept into the evaluation function (without doing a search)

Let me give another example. People tried to detect objects in pictures by explicitly coding a routine that does so. (They relied on edge detection and some other things but that's besides the point). No matter what they tried, there was always some cases where the object detection simple didnt work.

To get back to chess. You may think that you have covered all x-ray patterns or whatever evaluation term you have explicitly coded. Just to find out some time later that you did a horrible job because there were many cases you did not cover.

But the point is that I want to eliminate the NN altogether. It was nice to have it for training. But most people don't want an engine to train it, they want an engine to use it

It's not like everyone has to train their engine

If DeepMind decided to publish the AlphaZero-engine, they would supply the weights for the NeuralNetwork and no one would have to train it anymore. This obviously assumes that we have the hardware to run it on

syzygy · Post by **syzygy** » Tue Dec 12, 2017 12:55 am

syzygy wrote:It is MCTS without MC...

I'm not the first to complain about the naming:
http://talkchess.com/forum/viewtopic.ph ... 672#453672

CheckersGuy · Post by **CheckersGuy** » Tue Dec 12, 2017 12:58 am

syzygy wrote:
syzygy wrote:It is MCTS without MC...
I'm not the first to complain about the naming:
http://talkchess.com/forum/viewtopic.ph ... 672#453672

True. But for the lack of a better term I still say mcts ( most of the time). It should be quite clear for anyone that had a look at the alphaGoZero/AlphaZero papers that this is basically mcts without mc.
Maybe we should call it alphaZero-Tree-Search

Luis Babboni · Post by **Luis Babboni** » Tue Dec 12, 2017 2:53 am

Sorry if someone talk about it before, I tried to read all Alpha Zero´s comments but I´m not sure if I could.

I understand that from 100 games, Alpha zero won 25/50 with whites and 3/50 with blacks. None losses with any colour.

Could be possible that, learning playing only itself, A0 concluded that going for tie is the logical goal having blacks?
I mean, that did not won more with blacks cause did not tried it.

hgm · Post by **hgm** » Tue Dec 12, 2017 9:58 am

CheckersGuy wrote:It would obviously be nice to deduce what the neural network has actually learned but how would you go about that ?

An obvious way would be to measure correlation of cells with known simple domain-specific knowledge.

The reason why deep NN's are used is that they can learn complex/abstract functions.

Exactly. Because they can learn them. Not because they are very efficient for evaluating them.

What you describe isn't very far from a linear evaluation function which is not very complex at all.

Not at all. There is absolutely nothing linear in detecting features like pins, and X-rays. They are either there, or they aren't, which is a step function.

The NN from AlphaZero probably learned more abstract concepts than we could explicitly code.

Than we would explicitly code. Anything can be coded, once you kow what you need. But this is no problem at all, because you don't withold any information from the NN that it gets now. (Which I understand is just the board position.) You just help it o the way by making available on its inputs more information that could be enormously useful to it, and what it then doesn't have to learn to recognize in a comparatively very cumbersome way.

For example, stockfish may need a 20 ply search to see that he is dead lost and A0 only needs one call to the neural network because it has learned that those positions are lost.

This is just a wild guess, and I strongly doubt that there is any truth in it. For one, Stockfish usually has a very good idea whether it is dead lost. A quiescece search will do the trick. In positions where Stockfish would need 20 ply before it can see it is lost, e.g. whether an attack on its King fortress will be decisive or not, the NN will almost certainly be not able to recogize that either. Only when the loss is due to obvious strategic features that Stockfish is blind to, like trapping his own Bishop on g1/h2 behind blocked Pawns on f2/g3, the N (and human players) would immediately spot it. Things that can go either way depending on deep tactical calculation will be outside the scope of a NN. Its win-probabilty prediction might not even be better than Stockfish' QS.

But that is no problem, because it knows very well what it doesn't know, and can only be resolved by search. And then it orders the search, and the node will get expanded, and the precise tactics will get into view.
It would be very hard to explicitly encode this concept into the evaluation function (without doing a search).

The difference between AlphaZero and Stockfish is not that one sees things at zero ply that the other takes 20 ply to see. The difference is that to search the 20 ply needed to see it, Stockfish needs a million nodes, and Alpha Zero only a couple of hundred.

To get back to chess. You may think that you have covered all x-ray patterns or whatever evaluation term you have explicitly coded. Just to find out some time later that you did a horrible job because there were many cases you did not cover.

Well, so then the NN will have to learn to recognize these features from the board position itself. Which it has to do now anyway.

It's not like everyone has to train their engine If DeepMind decided to publish the AlphaZero-engine, they would supply the weights for the NeuralNetwork and no one would have to train it anymore. This obviously assumes that we have the hardware to run it on

And that is exactly the point. The trained NN is large and cumbersome, contaiing large parts that are practically uused, but would have been useful for Go, Shogi, Monopoly, Tennis.. You would need a whole lot less expensive hardware if you could cull all that out. And condense the parts that do trivial things in a cumbersome way to some dedicted goal. To show that a completely general NN, not using any domain-specific knowledge, can be trained to do the job is wonderful and shocking. But it doesn't mean that it is a smart way to achieve the goal.

Michel · Post by **Michel** » Tue Dec 12, 2017 10:16 am

There is absolutely nothing linear in detecting features like pins, and X-rays. They are either there, or they aren't, which is a step function.

This is a misinterpretation of the meaning of "linear" in this context. Linear means that the evaluation function is linear in the bonus/penalty for such a feature. This is the case in almost all evaluation functions used by chess engines.

In contrast NNs use a cascade of linear functions and nonlinear functions (the latter are logistic functions or some approximation of those).

Note that I do not want to claim that NNs are necessarily better (I don't know). But they are definitely more flexible than the evaluation functions typically used by chess engines.

Rein Halbersma · Post by **Rein Halbersma** » Tue Dec 12, 2017 10:35 am

hgm wrote:The trained NN is large and cumbersome, contaiing large parts that are practically uused

Compression NNs is an active topic of research: https://arxiv.org/abs/1710.09282

The main question is whether it is realistic to assume that it is possible to hand-code (or even automate) a series of chess-knowledge intensive patterns and have a linear combination of those patterns cover 99% of the information in the NN. The same observation about large unused knowledge is true for endgame databases, yet it has proven to be very difficult to find a small low-dimensional Boolean expression in terms of the piece locations that efficiently encodes all that knowledge (even with say 99% accuracy it is very hard to find a lossy compression that is fast to decode).

However, in draughts Fabien Letouzey made a very nice contribution by scanning the board (hence his program's name Scan) in 4 x 4 regions (and some larger patterns) and computed the indices of each pattern into a large lookup-table of weights. These weights were trained using logistic regression. So apart from the logistic mapping at the end, a linear combination of patterns and weights. The question is whether something like that works in chess.

Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess