GPU rumors 2021

smatovic · Post by **smatovic** » Fri Oct 29, 2021 11:24 am

I wrote that it is IMO unknown if NN based SuperSampling methods will prevail on the market, looking at this demo from Intel I guess upsampling is meant to stay, or alike.

--
Srdja

smatovic · Post by **smatovic** » Wed Nov 10, 2021 8:52 am

AMD's CDNA2 shipped to Frontier super-computer unveiled:

https://www.anandtech.com/show/17054/am ... le-servers

further architecture split of RDNA gamer arch and CDNA server arch, Vector-FPUs are now native 64-bit double precision, unified/coherent memory via Infinity Fabric, not a GPU anymore but an accelerator, curious what Nvidia and Intel will come up with to catch up in FP64 thorughput for HPC-class....

--
Srdja

Milos · Post by **Milos** » Wed Nov 10, 2021 10:08 pm

smatovic wrote: ↑Wed Nov 10, 2021 8:52 am AMD's CDNA2 shipped to Frontier super-computer unveiled:

https://www.anandtech.com/show/17054/am ... le-servers

further architecture split of RDNA gamer arch and CDNA server arch, Vector-FPUs are now native 64-bit double precision, unified/coherent memory via Infinity Fabric, not a GPU anymore but an accelerator, curious what Nvidia and Intel will come up with to catch up in FP64 thorughput for HPC-class....

--
Srdja

While this might sound impressive as FP64, it is totally pointless. FP64 is just a gigantic waste of resources. There is efficient training in BLFLOAT16 for most NN models that doesn't loose any accuracy and for inference tendency is INT8 or lower precision. So having only 383TOPS for INT8 is quite mediocre performance for HPC chip (with the same silicon efficiency one should expect roughly 64x larger throughput for INT8 compared to FP64, so with the amount of silicon used one could have made a chip with around 3000TOPS INT8).

smatovic · Post by **smatovic** » Sun Nov 14, 2021 9:13 am

Milos wrote: ↑Wed Nov 10, 2021 10:08 pm
smatovic wrote: ↑Wed Nov 10, 2021 8:52 am AMD's CDNA2 shipped to Frontier super-computer unveiled:

https://www.anandtech.com/show/17054/am ... le-servers

further architecture split of RDNA gamer arch and CDNA server arch, Vector-FPUs are now native 64-bit double precision, unified/coherent memory via Infinity Fabric, not a GPU anymore but an accelerator, curious what Nvidia and Intel will come up with to catch up in FP64 thorughput for HPC-class....

--
Srdja
While this might sound impressive as FP64, it is totally pointless. FP64 is just a gigantic waste of resources. There is efficient training in BLFLOAT16 for most NN models that doesn't loose any accuracy and for inference tendency is INT8 or lower precision. So having only 383TOPS for INT8 is quite mediocre performance for HPC chip (with the same silicon efficiency one should expect roughly 64x larger throughput for INT8 compared to FP64, so with the amount of silicon used one could have made a chip with around 3000TOPS INT8).

You do not expect me to explain the difference of 64-bit scientific-computing and INT8 NN inference, as the article mentions, AMD invested silicon in FP64 instead to widen its Matrix-Cores, FP64 throughput might be pointless for chess, classic and NN, but my point is the architecture split of gamer and server brand gpus, server goes 64-bit and gamer stays 32-bit, or alike.

--
Srdja

Milos · Post by **Milos** » Mon Nov 15, 2021 1:02 am

smatovic wrote: ↑Sun Nov 14, 2021 9:13 am
Milos wrote: ↑Wed Nov 10, 2021 10:08 pm
smatovic wrote: ↑Wed Nov 10, 2021 8:52 am AMD's CDNA2 shipped to Frontier super-computer unveiled:

https://www.anandtech.com/show/17054/am ... le-servers

further architecture split of RDNA gamer arch and CDNA server arch, Vector-FPUs are now native 64-bit double precision, unified/coherent memory via Infinity Fabric, not a GPU anymore but an accelerator, curious what Nvidia and Intel will come up with to catch up in FP64 thorughput for HPC-class....

--
Srdja
While this might sound impressive as FP64, it is totally pointless. FP64 is just a gigantic waste of resources. There is efficient training in BLFLOAT16 for most NN models that doesn't loose any accuracy and for inference tendency is INT8 or lower precision. So having only 383TOPS for INT8 is quite mediocre performance for HPC chip (with the same silicon efficiency one should expect roughly 64x larger throughput for INT8 compared to FP64, so with the amount of silicon used one could have made a chip with around 3000TOPS INT8).
You do not expect me to explain the difference of 64-bit scientific-computing and INT8 NN inference, as the article mentions, AMD invested silicon in FP64 instead to widen its Matrix-Cores, FP64 throughput might be pointless for chess, classic and NN, but my point is the architecture split of gamer and server brand gpus, server goes 64-bit and gamer stays 32-bit, or alike.

--
Srdja

You could for example find one NN, could be anything CNN, RNN, Transformer, whatever you like, where FP32 training yields a net that performs worse than FP64 trained net. Just one would be enough.

One ofc needs to know things a bit more than on a hobby level for this

.
64-bit "scientific computing" is nothing but arbitrary BS. Just pointless. It's a result when hardware decisions are made by stupid marketing ppl instead of scientist and engineers. Another reason why AMD will always be a joke in the filed of ML.

jhellis3 · Post by **jhellis3** » Mon Nov 15, 2021 1:10 am

High precision math is used in many fields, which is why it exists I suppose.

smatovic · Post by **smatovic** » Mon Nov 15, 2021 4:30 am

Milos wrote: ↑Mon Nov 15, 2021 1:02 am
smatovic wrote: ↑Sun Nov 14, 2021 9:13 am
Milos wrote: ↑Wed Nov 10, 2021 10:08 pm
smatovic wrote: ↑Wed Nov 10, 2021 8:52 am AMD's CDNA2 shipped to Frontier super-computer unveiled:

https://www.anandtech.com/show/17054/am ... le-servers

further architecture split of RDNA gamer arch and CDNA server arch, Vector-FPUs are now native 64-bit double precision, unified/coherent memory via Infinity Fabric, not a GPU anymore but an accelerator, curious what Nvidia and Intel will come up with to catch up in FP64 thorughput for HPC-class....

--
Srdja
While this might sound impressive as FP64, it is totally pointless. FP64 is just a gigantic waste of resources. There is efficient training in BLFLOAT16 for most NN models that doesn't loose any accuracy and for inference tendency is INT8 or lower precision. So having only 383TOPS for INT8 is quite mediocre performance for HPC chip (with the same silicon efficiency one should expect roughly 64x larger throughput for INT8 compared to FP64, so with the amount of silicon used one could have made a chip with around 3000TOPS INT8).
You do not expect me to explain the difference of 64-bit scientific-computing and INT8 NN inference, as the article mentions, AMD invested silicon in FP64 instead to widen its Matrix-Cores, FP64 throughput might be pointless for chess, classic and NN, but my point is the architecture split of gamer and server brand gpus, server goes 64-bit and gamer stays 32-bit, or alike.

--
Srdja
You could for example find one NN, could be anything CNN, RNN, Transformer, whatever you like, where FP32 training yields a net that performs worse than FP64 trained net. Just one would be enough.
One ofc needs to know things a bit more than on a hobby level for this .
...

INT8 is sufficient for inference and BF16 is sufficient for training? Milos, you are telling old news. Go lookup when Google did come up with its TPU gen1 and TPU gen2...well, these guys know their workloads

--
Srdja

smatovic · Post by **smatovic** » Sun Mar 13, 2022 9:28 am

Milos wrote: ↑Mon Aug 30, 2021 12:07 pm
smatovic wrote: ↑Sat Aug 28, 2021 5:32 pm
Milos wrote: ↑Thu Apr 29, 2021 12:39 pm
smatovic wrote: ↑Thu Apr 29, 2021 10:01 am One big player is missing in the above, IBM, will they come up with an own gpu
arch or revive their PowerXCell? I doubt that, with POWER10 they went another
path, up to 16 core SMT8 design CPU, basically 128 cores each with own ALU,
FPU, branch prediction, load/store and 128b SIMD unit. Instead to offload tasks
to external GPU they put the horse-power back into CPU with up to 1 TB/s IO per
socket to external memory controller with up to 16 sockets in total, also
notable 4 MMA (matrix math assist) units per core for NN inference stuff.

https://en.wikipedia.org/wiki/POWER10

--
Srdja
Hehe, so wrong.
Can't disclose details but you can check this one:
https://www.ibm.com/blogs/research/2021 ... n-scaling/
It will be part of next IBM products.
Power is yesterday's news, Z systems is to look to (like the upcoming z16).
Must be the IBM Telum processor:

https://www.anandtech.com/show/16901/ho ... ire-rapids

https://www.ibm.com/blogs/systems/ibm-t ... -linuxone/

6 TFLOPs per chip (precision?vector/matrix?) with 2 chips per socket, with 4-way sockets up to 32 chip-configurations.

--
Srdja
Yes it's IBM Telum, this is now official PR.
Btw. my work was in one of those slides in HotChips presentation .
All matmul operations are FP16 (IBM's format) with FP32 accumulation.

Milos, looking now back, with GPGPU, ML, NN, Cloud computing, what's your private opinion, was there a chance for IBM to continue the PowerXCell in 2009 and compete with Nvidia?

--
Srdja

Jouni · Post by **Jouni** » Mon Mar 14, 2022 7:53 pm

Fun fact. LUMI supercomputer in Kajaani/Finland(!) will consist of 2560 nodes, each node with one 64 core AMD Trento CPU and four AMD MI250X GPUs. So 10240 GPUS totally. A single MI250X card is capable of delivering 42.2 TFLOP/s of performance in the HPL benchmarks.

Milos · Post by **Milos** » Tue Mar 15, 2022 8:02 pm

Jouni wrote: ↑Mon Mar 14, 2022 7:53 pm Fun fact. LUMI supercomputer in Kajaani/Finland(!) will consist of 2560 nodes, each node with one 64 core AMD Trento CPU and four AMD MI250X GPUs. So 10240 GPUS totally. A single MI250X card is capable of delivering 42.2 TFLOP/s of performance in the HPL benchmarks.

43TFLOPs in FP64 is impressive, however it's AMD, it doesn't support DirectX so one has to use OpenCL. Since we are talking about ML applications, you can cut these 43TFLOPs to maybe 15-20 "CUDA equivalent" TFLOPs which is much less impressive. It's probably a heavily sponsored thing by AMD trying to make some foothold in ML field. But, IMO totally futile and pointless.

GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021