AlphaGo and Stockfish played on similar hardware

vvarkey · Post by **vvarkey** » Sun Dec 17, 2017 7:12 pm

Ignore all the training that went into AlphaZero for a second.

Per the paper, for the 100 games:

AlphaZero and the previous AlphaGo Zero used a single machine with 4 TPUs. Stockfish and Elmo played at their strongest skill level using 64 threads and a hash size of 1GB.

According to https://cloud.google.com/blog/big-data/ ... g-unit-tpu:

We announced the TPU last year and recently followed up with a detailed study of its performance and architecture. In short, we found that the TPU delivered 15–30X higher performance and 30–80X higher performance-per-watt than contemporary CPUs and GPUs.

So, a single machine with 4 TPUs (15x4 = 60) is somewhat comparable to 64 CPU threads.

Now, for training AlphaGo, DeepMind really did use tons of hardware: 5,000 Gen 1 TPUs to generate the games for training and 64 Gen 2 TPUs for training the neural nets.

But for comparing playing strengths, these numbers are as relevant as counting how many man-hours went into the development of Stockfish.

hgm · Post by **hgm** » Sun Dec 17, 2017 7:28 pm

I am not sure that with 'contemporary CPUs' they mean 'single cores'. It is also unclear whether Stockfish was using 64 cores, or just 64 hyper threads o 32 cores.

However, your basic claim seems to be correct. TPUs and (multi-core) CPUs are comparable hardware. They just do very different things, and what one is good at, the other can do only very poorly, or not at all. It is probably quite easy to find tasks that TPUs would do much slower than a CPU. Of course you would see those seldomly mentioned in promotional material for TPUs.

One could argue that the TPUs are specifically adapted to run neural etworks, and that Stockfish had to ru on hardware not specially designed to run Stockfish, but a general CPU equally suitable for many tasks. OTOH, the TPUs are not specifically designed for running the AlphaZero network; 'neural network' is still a pretty general application as well.

syzygy · Post by **syzygy** » Sun Dec 17, 2017 8:18 pm

AlphaZero probably ran on the same PC as Stockfish: a PC with 32 or 64 general-purpose cores and 4 TPUs.

So the hardware was identical. Stockfish just chose not to make use of the TPUs. Or at least, that is one way of looking at it.

If the TPUs are indeed first-generation TPUs, they apparently consume 28-40 Watt each. 160 Watt is less than what the 32/64 cores will use. And these first generation TPUs are manufactured using 28nm technology from 2010/2011.

vvarkey · Post by **vvarkey** » Sun Dec 17, 2017 8:56 pm

hgm wrote:I am not sure that with 'contemporary CPUs' they mean 'single cores'.

oops. in the actual TPU paper https://arxiv.org/pdf/1704.04760.pdf:

The traditional CPU server is represented by an 18-core, dual-socket Haswell processor from Intel. The GPU accelerator is the Nvidia K80.

So 4 TPUs = 18x15x4 = at least 1080 cores (threads)

I think Stockfish was using 64 actual cores since Google's Compute Engine offers 64 core CPUs.

Milos · Post by **Milos** » Sun Dec 17, 2017 11:51 pm

syzygy wrote:AlphaZero probably ran on the same PC as Stockfish: a PC with 32 or 64 general-purpose cores and 4 TPUs.

So the hardware was identical. Stockfish just chose not to make use of the TPUs. Or at least, that is one way of looking at it.

If the TPUs are indeed first-generation TPUs, they apparently consume 28-40 Watt each. 160 Watt is less than what the 32/64 cores will use. And these first generation TPUs are manufactured using 28nm technology from 2010/2011.

TSMC 28nm process (first 28nm process ever) from late 2011. But actual TPUs were fabricated in 2015.
P.S. Stockfish didn't have any choice nor SF authors for that matter. DeepMind used it in the way they liked. We even don't know which compile was used, official, BMI capable of they compiled it themselves.

Michael Sherwin · Post by **Michael Sherwin** » Mon Dec 18, 2017 1:42 am

vvarkey wrote:Ignore all the training that went into AlphaZero for a second.

You really cannot do that, not even for a millisecond. The training (did I read 44 million games?) is worth more than 1,000 elo and probably much more. The learning was guided by NN to learn on the most promising lines thus narrowing the field. A0 could have gotten winning positions against SF without ever leaving its learn file. The rest of the positions were so good that the chess playing algorithm of A0 could then get a win or at least draw. Believe me I know as I've seen RomiChess play entire games from its learn file. Even if the learn file does not produce a move to play immediately the fact that the whole subtree of the current position with its learned values are loaded into the hash causes the search to return much stronger moves on average. You can't ignore the training, it is 90% of the strength of A0.

Dirt · Post by **Dirt** » Mon Dec 18, 2017 2:11 am

Michael Sherwin wrote:You really cannot do that, not even for a millisecond. The training (did I read 44 million games?) is worth more than 1,000 elo and probably much more. The learning was guided by NN to learn on the most promising lines thus narrowing the field. A0 could have gotten winning positions against SF without ever leaving its learn file. The rest of the positions were so good that the chess playing algorithm of A0 could then get a win or at least draw. Believe me I know as I've seen RomiChess play entire games from its learn file. Even if the learn file does not produce a move to play immediately the fact that the whole subtree of the current position with its learned values are loaded into the hash causes the search to return much stronger moves on average. You can't ignore the training, it is 90% of the strength of A0.

You could train AlphaZero on chess and then make it play Fischer Random. To eliminate the development time you could even limit it to those FRC positions (5?) where castling doesn't change.

I'm not sure what that would tell us but I'd find it interesting.

Michael Sherwin · Post by **Michael Sherwin** » Mon Dec 18, 2017 3:13 am

Dirt wrote:
Michael Sherwin wrote:You really cannot do that, not even for a millisecond. The training (did I read 44 million games?) is worth more than 1,000 elo and probably much more. The learning was guided by NN to learn on the most promising lines thus narrowing the field. A0 could have gotten winning positions against SF without ever leaving its learn file. The rest of the positions were so good that the chess playing algorithm of A0 could then get a win or at least draw. Believe me I know as I've seen RomiChess play entire games from its learn file. Even if the learn file does not produce a move to play immediately the fact that the whole subtree of the current position with its learned values are loaded into the hash causes the search to return much stronger moves on average. You can't ignore the training, it is 90% of the strength of A0.
You could train AlphaZero on chess and then make it play Fischer Random. To eliminate the development time you could even limit it to those FRC positions (5?) where castling doesn't change.

I'm not sure what that would tell us but I'd find it interesting.

If I understand A0's learning approach and I think that I do then all the pretraining at classic chess would be useless against Fischer Random unless it transposes somehow and the A0 learned tree can handle transpositions. However, A0 could train 44 million games on all FR positions with the same effect.

mjlef · Post by **mjlef** » Mon Dec 18, 2017 4:20 am

I do not think so. Looking at the TensorFLow Processing unit specs from there (second gen):

https://en.wikipedia.org/wiki/Tensor_processing_unit

it says " 45 TFLOPS"

Intel in typical literature for a 72 core machnes:

https://www.intel.com/content/www/us/en ... ssors.html

says "With up to 72 out-of-order cores, the new Intel® Xeon Phi™ processor delivers over 3 teraFLOPS "

It is a bit unclear how many of the chips they cite goes into one second generation TPU.

the quotes about power per TFLOP does not really tell us, but the above confirms TPUs are much faster at neural nets than a standard Intel chip.

Dirt · Post by **Dirt** » Mon Dec 18, 2017 5:53 am

Michael Sherwin wrote:If I understand A0's learning approach and I think that I do then all the pretraining at classic chess would be useless against Fischer Random unless it transposes somehow and the A0 learned tree can handle transpositions. However, A0 could train 44 million games on all FR positions with the same effect.

How well AlphaZero handles FRC positions without specific training for them is the question I was getting at. We disagree on how well it would do, and without a way to do the actual test I see no way to know for sure which of us is correct.

AlphaGo and Stockfish played on similar hardware

AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware

Re: AlphaGo and Stockfish played on similar hardware