This is one of the best summaries from the AGZ paper assuming the same DCNN is used for chess. However, there are no indications that DCNN for chess is organized in the same way as for Go, since there is no mentioning about this in the paper. I guess they left it for the next Nature publication.Rein Halbersma wrote:The deep neural network connects the pieces on different squares to each other. They use 3x3 convolutions. This means that the next 8x8 layer's cells are connected to a 3x3 region (called "receptive field") in the previous region, and to a 5x5 region in the layer before etc. After only 4 layers, each cell is connected to every other cell in the original input layer. For AlphaGoZero they used no less than 80 layers. Then they also have many "feature maps" in parallel, so that they can learn different concepts related to piece-square combinations. Finally, they use the last 8 positions as input as well, so they also have a sense of ongoing maneuvers. All this is then being trained on the game result and the best move from the MC tree search.
Although the amount of resources required to train the millions of weights related to these neural networks is enormous, conceptually it is not surprising that pawn structure, king safety, mobility and even deep tactics can be detected from the last 8 positions.
We know how are input features organized, we know the policies, but that really doesn't tell much about actual network implementation especially since both inputs and policies are totally different and much more complex in case of chess compared to Go.
The only thing we can guess from the paper is the total number/size of weights of NN.
We have 80k searches per second (so not only evaluations but complete MCT searches that perform one leaf eval per search on 4 BTUs (taking it from 8 leaf nodes deep queue), and there is also no mention of the hardware running actual MCTS but that is most probably general purpose CPU not weaker than the one that was running SF).
Since BTUs eval speed is the same as in training they were most probably the same, i.e. first generation ones performing 92 T int8 FMAs.
So, 4*92T/80k = 4.6GB of weights.