A question about speed difference between gpu and cpu

Sesse · Post by **Sesse** » Fri Aug 10, 2018 7:03 pm

So are GPUs:

AMD GCN is has 16-element registers (SIMD16), but solve pipelining by doing four instructions, so it essentially has a wavefront of 64 elements. But there's only 16 elements per cycle.
Intel's GPUs are predominantly SIMD16, although it can also run SIMD32 (rare) or SIMD8 (more common when there's register pressure).
NVIDIA is always SIMD32.

None of these can do something like “add this 1024-element array to this other 1024-element array” natively. They need loops just like a CPU, or you need to split the work across multiple cores (better).

It is true that all the cores are typically running the same shader, though, although there are mobile GPUs where I believe this restriction can be lifted somewhat.

Sesse · Post by **Sesse** » Fri Aug 10, 2018 7:16 pm

By the way, note that CUDA (and also other graphics APIs) will hide some of this for you. If you launch a kernel with e.g. x=128 y=128 z=1, the driver will compile your shader down to SIMD32, and then launch 512 threads with that shader. They may all go onto the same SM or they may not (GTX 1080 has 20 SMs, GT 1030 has three); it depends on your local group size, what the driver thinks and probably the phase of the moon.

So from the outside, it may look like a super-wide vector processor, but it's not actually like that on the inside; it really is a SIMD machine with lots of slow cores (and a very different memory subsystem and threading implementation). For trivial examples, this doesn't matter, but if you want serious performance on anything nontrivial, you definitely need to get down into the actual implementation details.

oreopoulos · Post by **oreopoulos** » Fri Aug 10, 2018 8:45 pm

Yes... I am aware of the Stream Processors and that GPU is like a Multi-Core SM

For example, the Jetson has 1 such core. But again the size of the SM is not comparable to the one of the CPU.

odomobo · Post by **odomobo** » Fri Aug 10, 2018 8:47 pm

Sesse wrote: ↑Fri Aug 10, 2018 12:25 pm I don't see what's misleading about it.

In my rough calculation, the per-core FLOPS performance of a 1080ti is about 25% of the per-core FLOPS performance of an i7-8700k, which would indicate that the GPU core is the equiv of roughly a 900mhz CPU core. Shouldn't this be be pretty representative of a typical GPU workload? Maybe in your 80mhz figure, you were referring to AMD GPUs (which I know nothing about)?

You seem to have more experience with this than me, so let me know if I've made some mistake.

Sesse · Post by **Sesse** » Sat Aug 11, 2018 1:46 am

I don't know where your FLOPS number comes from, but again, you have to remember the context here. The 80 MHz figure is for serial code—any theoretical FLOPS number comes from a situation with maximum parallelism (which is of course never realizable in practice).

To deconstruct the numbers here a bit: A GTX 1080 has 2560 “CUDA cores”, which is a fancy name for 2560 ALUs. They are organized into 20 SMs, where each SM has four processing units that work in SIMD32. One such processing unit can issue a floating-point SIMD32 muladd per cycle and runs at 1607 MHz—which is counted here as 64 operations (32 mul, 32 add) for marketing purposes. However, the result doesn't come back on the next cycle, but something like 10–15 cycles later, so to get to this kind of speed, you'd need to have 80*64*15 = 76800 operations going at the same time! And no waiting for memory or the likes.

So for theoretically optimal parallel code: 20 SMs * 4 processing units/SM * 64 flops/cycle * 1607 MHz = 8227 GFLOPS. This is NVIDIA's marketing number.

However, if you have strictly serial code (ie., a problem that's really poorly matched to the GPU), you only get to use one of those 80 processing units, and you don't get any use of the 32-wide SIMD either. And worse, once you issue an operation, you need to wait those 10–15 cycles before it comes back. 1607 MHz * (1/15 ops/sec) = 107 MHz. A typical GPU is a bit slower than the 1080, so that's where the ~80 MHz number comes from.

Occasionally, parts of your algorithm may be very serial and parts may be very parallel. In those cases, it can actually be worth it to run the serial part on the GPU as well, if it finishes quickly enough (ie., doesn't take too long in 80 MHz speed) and can hand off the results to the parallel part.

A question about speed difference between gpu and cpu

Re: A question about speed difference between gpu and cpu

Re: A question about speed difference between gpu and cpu

Re: A question about speed difference between gpu and cpu

Re: A question about speed difference between gpu and cpu

Re: A question about speed difference between gpu and cpu