EvgeniyZh wrote:The info on TPUs is vague, but it's said to have ~45 TFLOPs (half precision probably). For example see
here. That would mean that AlphaZero ran 180 TFLOPs system. It's believed 1080 Ti is kinda cost-optimal for DL, and you'd need 16-18 of them to match performance (you may round up to 20). That's not what you'd put at home, but many DL researchers have that amount of resources. I'd roughly approximate it around $60k for the whole thing, give or take. With next generation GPU you probably can fit the whole thing in one node.
lkaufman wrote:
The other conditions were of course not "fair", but reasonable given that AlphaZero only trained for a few hours. I suppose if Stockfish used a good book, was allowed to use its time management as if the time limit were pure increment, and used the latest dev. version, the match would have been much closer, but probably (judging by the infinite win to loss ratio and the actual games) SF would have still lost. The games were amazing.
Bottom line, assuming the comparable cost claim is accurate: If Google wants to optimize the software for a few weeks and sell it, rent it, or give it away, we have a revolution in computer chess. But my guess is that they won't do this, in which case the revolution may be delayed a couple years or so.
First you have to understand what TPU is. There is enough material on that, published by no one else but Google.
https://arxiv.org/abs/1704.04760
Second it is not 45 TFLOPs but 92
TOPS and that is
first generation TPU. They don't say explicitly in the paper which generation TPU they used for inference (they say it just for training) but logic kind of tells us second generation is more probable.
Second generation TPUs performance is 180 TOPS.
It is int8 multiplication not single or double floating point precision operations you are used to from common GPUs and NVIDIA in general and it is certainly not tensor FLOPS (stupid marketing term by NVIDIA that has zero meaning in reality).
V100 has 15 TFLOPS single precision, that is the most you can get if you use single precision floating point as a replacement for integer multiplication. So you would need 6 V100 for one first generation TPU, and 12 for second generation one.
Alpha0 used 4 TPUs for running games, so at best 24 V100, at worst 48 V100.
V100 will at best cost 10k$, or 250k$ of half a million bucks just to run alpha0, and you think there would be chess enthusiasts to afford it???
And give me a break with theoretical GP102 performance (1080Ti). I work with them for ML and that is pure BS, so much BS that NVIDIA actually never published the figure, but instead what ppl compute as num_cores x frequency x 2 which is totally detached from reality.
In reality if you run int multiplications on it you'd see the performance is not even 1 TOPS (for int multiplication).
You think NVIDIA is so stupid to sell V100 for >10k$ offering almost the same performance as 1080Ti that costs 600$???