Why GPUs are well-suited to deep learning

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

duncan
Posts: 12038
Joined: Mon Jul 07, 2008 10:50 pm

Why GPUs are well-suited to deep learning

Post by duncan »

https://www.quora.com/Why-are-GPUs-well ... p-learning


As many have said GPUs are so fast because they are so efficient for matrix multiplication and convolution, but nobody gave a real explanation why this is so. The real reason for this is memory bandwidth and not necessarily parallelism.

First of all you have to understand that CPUs are latency optimized while GPUs are bandwidth optimized. You can visualize this as a CPU being a Ferrari and a GPU being a big truck. The task of both is to pick up packages from a random location A and to transport those packages to another random location B. The CPU (Ferrari) can fetch some memory (packages) in your RAM quickly while the GPU (big truck) is slower in doing that (much higher latency). However, the CPU (Ferrari) needs to go back and forth many times to do its job (location A -> pick up 2 packages -> location B ... repeat) while the GPU can fetch much more memory at once (location A -> pick up 100 packages -> location B ... repeat).

So in other words the CPU is good at fetching small amounts of memory quickly (5 * 3 * 7) while the GPU is good at fetching large amounts of memory (Matrix multiplication: (A*B)*C). The best CPUs have about 50GB/s while the best GPUs have 750GB/s memory bandwidth. So the larger your computational operations are in terms of memory, the larger the advantage of GPUs over CPUs. But there is still the latency that may hurt performance in the case of the GPU. A big truck may be able to pick up a lot of packages with each tour, but the problem is that you are waiting a long time until the next set of packages arrives. Without solving this problem GPUs would be very slow even for large amounts of data. So how is this solved?

If you ask a big truck to make a number of tours to fetch packages you will always wait for a long time for the next load of packages once the truck has departed to do the next tour — the truck is just slow. However, if you now use a fleet of either Ferraris and big trucks (thread parallelism), and you have a big job with many packages (large chunks of memory such as matrices) then you will wait for the first truck a bit, but after that you will have no waiting time at all, because unloading the packages takes so much time that all the trucks will queue in unloading location B so that you always have direct access to your packages (memory). This effectively hides latency so that GPUs offer high bandwidth while hiding their latency under thread parallelism — so for large chunks of memory GPUs provide the best memory bandwidth while having almost no drawback due to latency via thread parallelism. This is the second reason why GPUs are faster than CPUs for deep learning. As a side note, you will also see why more threads do not make sense for CPUs: A fleet of Ferraris has no real benefit in any scenario.
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: Why GPUs are well-suited to deep learning

Post by Werewolf »

duncan wrote:https://www.quora.com/Why-are-GPUs-well ... p-learning


As many have said GPUs are so fast because they are so efficient for matrix multiplication and convolution, but nobody gave a real explanation why this is so. The real reason for this is memory bandwidth and not necessarily parallelism.
This could be a strong argument for the Titan V. It uses different memory to the other cards.
cma6
Posts: 219
Joined: Thu May 29, 2014 5:58 pm

GPUs in deep learning world

Post by cma6 »

Carl,
You are just the man to do a video on how the chess enthusiast (living in the old CPU world) can get started in the new deep learning world:
1) How to buy and use a GPU setup plus ancillary hardware issues;
2) How to setup and run LCZero or look alikes as the come along.
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: GPUs in deep learning world

Post by Werewolf »

If I get time, I'll try. I'm on holiday next week so something has to occupy me by the beach

8-)
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Why GPUs are well-suited to deep learning

Post by Gian-Carlo Pascutto »

Werewolf wrote: This could be a strong argument for the Titan V. It uses different memory to the other cards.
This is a bit of a misunderstanding, but admittedly the original linked explanation is very unclear.

When he talks about "faster memory", he is referring to the large register files on the GPU chips, not their RAM. Much of the optimization (for example if you run Leela's tuner) is exactly to avoid having to access RAM as much as possible. Because even GDDR5 or HBM2 is orders of magnitude slower than the raw processing power of all the ALU units on a GPU.

The algorithm exploits all caches, first loading the data into local (shared) RAM, and then loading into from there into the register file, trying to maximize reuse.

You can see these different "local" memories in a graph of the logical GPU layout:
http://cdn.wccftech.com/wp-content/uplo ... iagram.png

There is a very good explanation of many of the used optimizations here:
https://cnugteren.github.io/tutorial/pages/page1.html
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: Why GPUs are well-suited to deep learning

Post by Werewolf »

Gian-Carlo Pascutto wrote:
Werewolf wrote: This could be a strong argument for the Titan V. It uses different memory to the other cards.
This is a bit of a misunderstanding, but admittedly the original linked explanation is very unclear.

When he talks about "faster memory", he is referring to the large register files on the GPU chips, not their RAM. Much of the optimization (for example if you run Leela's tuner) is exactly to avoid having to access RAM as much as possible. Because even GDDR5 or HBM2 is orders of magnitude slower than the raw processing power of all the ALU units on a GPU.

The algorithm exploits all caches, first loading the data into local (shared) RAM, and then loading into from there into the register file, trying to maximize reuse.

You can see these different "local" memories in a graph of the logical GPU layout:
http://cdn.wccftech.com/wp-content/uplo ... iagram.png

There is a very good explanation of many of the used optimizations here:
https://cnugteren.github.io/tutorial/pages/page1.html
Helpful.

If the on-card memory is not too important, what about the GFLOPS of the card?

Would an 8 GFLOP card be twice as fast for LCZero as a 4 GFLOP card?
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Why GPUs are well-suited to deep learning

Post by Gian-Carlo Pascutto »

Werewolf wrote: If the on-card memory is not too important, what about the GFLOPS of the card?

Would an 8 GFLOP card be twice as fast for LCZero as a 4 GFLOP card?
Roughly, yes.

But not when comparing between different vendors (AMD vs NVIDIA) - the GFLOPS that are advertised on the cards are basically marketing numbers for unrealistic, ideal circumstances.

If you run the tuner, you will see that even the fastest configuration won't quite reach the same GFLOPS as the marketing number.
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: Why GPUs are well-suited to deep learning

Post by Werewolf »

Gian-Carlo Pascutto wrote:
Werewolf wrote: If the on-card memory is not too important, what about the GFLOPS of the card?

Would an 8 GFLOP card be twice as fast for LCZero as a 4 GFLOP card?
Roughly, yes.

But not when comparing between different vendors (AMD vs NVIDIA) - the GFLOPS that are advertised on the cards are basically marketing numbers for unrealistic, ideal circumstances.

If you run the tuner, you will see that even the fastest configuration won't quite reach the same GFLOPS as the marketing number.
Thanks.

This list here

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

is therefore confusing, especially the poor showing of the fastest card: the 1080 Ti
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Why GPUs are well-suited to deep learning

Post by Gian-Carlo Pascutto »

Werewolf wrote: This list here

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

is therefore confusing, especially the poor showing of the fastest card: the 1080 Ti
I'm going to guess LC0 has some CPU bottlenecks still? Some of the performance numbers look weird.