GPU rumors 2021

smatovic · Post by **smatovic** » Fri Aug 30, 2024 4:52 am

Werewolf wrote: ↑Thu Aug 29, 2024 8:35 pm [...]
I'm hearing 5090 in early 2025 and Threadripper 9000 (Turin / Ryzen 5) sooner but with about 10% performance gain over last gen, which isn't that great.

Yes, it was said that Nvidia Blackwell release was delayed and there are benchmarks for Zen 5 with SF 16.1:

https://www.talkchess.com/forum/viewtopic.php?t=84131

According to WP Blackwell and Zen 5 are fabricated on TSMC 4NP, an advanced 5nm process, maybe Nvidia Rubin and Zen 5+/6 will be bigger steps for us end users in regard of core-count resp. transistor-count (3nm fab process, "tick-tock-cycle").

--
Srdja

Werewolf · Post by **Werewolf** » Fri Aug 30, 2024 12:16 pm

Thanks.

It seems like Intel is long way behind at the high-end, though I note there has been a (small) update to Sapphire Rapids.

Do you have any info on Emerald Rapids and Diamond Rapids?

smatovic · Post by **smatovic** » Fri Aug 30, 2024 3:04 pm

Werewolf wrote: ↑Fri Aug 30, 2024 12:16 pm [...]
Do you have any info on Emerald Rapids and Diamond Rapids?

There is some info on Wikipedia:

Emerald Rapids on "Intel 7" (formely "10nm") process with up to 64 cores in current Xeons since 2023
https://en.wikipedia.org/wiki/Emerald_Rapids

Granite Rapids on "Intel 3" process with up to 128 (performance) cores in upcoming Xeon 6 (2024H2?)
https://en.wikipedia.org/wiki/Granite_Rapids

Sierra Forest on "Intel 3" process with up to 288 E-cores
https://en.wikipedia.org/wiki/Sierra_Forest

Diomand Rapids has yet no article, launch planned for 2025/2026?
https://wccftech.com/intel-next-gen-xeo ... n-e-cores/

Notable is that Intel uses tiles resp. slices contrary to AMD's chiplets, AFAIK up to four slices are coupled to one single chip, where AMD uses up to eight chiplets.

The Xeon performance cores have AVX-512 and also AMX (mat-mul) support but not its E-cores.

Contrary AMD's "c" cores with less cache and lower frequency also support AVX-512.

According to Wikipedia, it is unknown how many transistors/mm2 "Intel 3" process packs:

https://en.wikipedia.org/wiki/3_nm_proc ... cess_nodes

so we can not yet compare it with the others.

Intel has a tough roadmap for fab process, question is if they can catch up with the other silicon players IMO.

--
Srdja

smatovic · Post by **smatovic** » Mon Sep 02, 2024 9:56 am

smatovic wrote: ↑Fri Aug 30, 2024 3:04 pm
Werewolf wrote: ↑Fri Aug 30, 2024 12:16 pm [...]
Do you have any info on Emerald Rapids and Diamond Rapids?
[...]
Sierra Forest on "Intel 3" process with up to 288 E-cores
https://en.wikipedia.org/wiki/Sierra_Forest
[...]
The Xeon performance cores have AVX-512 and also AMX (mat-mul) support but not its E-cores.
[...]

Ipman with SF 14.1* has a new single node record holder, Sierra Forest, up to 510M NPS with two sockets:

Code: Select all

510.152.600	2x Intel Xeon 6 Sierra Forest 1TB DDR5	384threads	avx512	TataneSan	L
475.124.036	2x Intel Xeon 6 Sierra Forest 1TB DDR5	384threads	pop	TataneSan	L

https://ipmanchess.yolasite.com/amd--in ... ckfish.php

Notable, pop compile vs. avx512 compile, AVX2 compile would be interesting to see.

*As noted in another thread, Stockfish 16.1 speedup via AVX-512 might be bigger than with SF 14.1

...will be probably caught up again by dual AMD EPYC Zen5 and Zen5c.

--
Srdja

smatovic · Post by **smatovic** » Tue Sep 10, 2024 9:31 pm

Now Loongson is also in with GPUs, integrated and discrete:

Next-gen Chinese GPU touts RTX 2080-level performance — Loongson claims 9A2000 is up to 10x faster than the 9A1000
https://www.tomshardware.com/pc-compone ... the-9a1000

Chinese chipmaker teases “world-leading” performance of next-gen 7nm CPU — 3B6600 rocks eight LA864 cores clocked at 3 GHz
https://www.tomshardware.com/pc-compone ... d-at-3-ghz

--
Srdja

smatovic wrote: ↑Sun May 19, 2024 3:24 pm
smatovic wrote: ↑Sun Dec 03, 2023 10:33 am ....on Chinese GPUs:

Moore Threads
https://en.wikipedia.org/wiki/Moore_Threads

Biren Technology
https://en.wikipedia.org/wiki/Biren_Technology

Zhaoxin
https://en.wikipedia.org/wiki/Zhaoxin#Discrete_GPU

--
Srdja
Lingjiu GP201 compares to Nvidia lowest end model GT 1030 by specs:

The GP201 boasts a base clock of 1.2 GHz, single precision floating point performance of 1.2 TFLOPS, 2GB DDR4 VRAM, and power consumption of up to 30W.
Chinese-made GPU beats performance of 10-year-old integrated AMD graphics — Lingjiu GP201 hits mass production
https://www.tomshardware.com/pc-compone ... production

I can imagine that this is just a beginning, or alike.

--
Srdja

smatovic · Post by **smatovic** » Tue Sep 10, 2024 9:35 pm

One for the critics:

We're in the brute force phase of AI – once it ends, demand for GPUs will too
https://www.theregister.com/2024/09/10/ ... /?td=rt-3a

Generative AI is, in short, being asked to solve problems it was not designed to solve.

--
Srdja

towforce · Post by **towforce** » Thu Sep 12, 2024 12:18 am

smatovic wrote: ↑Tue Sep 10, 2024 9:35 pm One for the critics:

We're in the brute force phase of AI – once it ends, demand for GPUs will too
https://www.theregister.com/2024/09/10/ ... /?td=rt-3a

Generative AI is, in short, being asked to solve problems it was not designed to solve.

Coincidentally, I've also read that, and also made a couple of comments as "NewThought" (one a simple reply answering a question, the other my actual thoughts on the article). Any thoughts about my thoughts?

As a general thought: graphics cards are not ideal for AI - TPUs and the like are better suited - they just happen to be relatively cheap and readily available high powered computing devices that can be put to uses other than drawing pictures.

smatovic · Post by **smatovic** » Thu Sep 12, 2024 7:22 am

towforce wrote: ↑Thu Sep 12, 2024 12:18 am [...]
As a general thought: graphics cards are not ideal for AI - TPUs and the like are better suited - they just happen to be relatively cheap and readily available high powered computing devices that can be put to uses other than drawing pictures.

Preamble: I am not qualified to write on low level implementation of neural networks for generative AIs, Transformers, Stable Diffusion, etc.

What we know is, that image recognition with CNNs got a boost in the 2010s by using GPUs, data level parallelism, "embarrassingly easy parallelism":

https://en.wikipedia.org/wiki/Embarrassingly_parallel

What we know is that we need scalar, vector and matrix operations on our computing devices for AI, hence meanwhile all AI silicon offers these to various degrees and in different flavors.

What we know is that data-center GPUs (better "AI accelerators") and AI models do co-evolve, the hardware and software does evolve together.

But what the article mentions is that we will see a shift from special purpose hardware to general purpose hardware in this regard.

Take Lc0 with CNNs and SF with NNUE in our computer chess domain for example.

~2018 Lc0 took off, and people thought they need a high-end GPU for several thousand dollars to play top notch computer chess. Then 2020 NNUE took off, and a common CPU was sufficient.

With NNUE we have the division for training neural networks with big data via GPUs and running neural network inference more efficient on a CPU.

We already see the "AI PC" with dedicated NPU chip, so the big players might have come to the conclusion that such a division makes sense too for generative AI.

I saw recently an article that we might not need matrix-multiplications for neural networks at all, we just need to rethink our AI models, thus, there is for sure some hardware-software co-evolution in progress.

--
Srdja

Ras · Post by **Ras** » Thu Sep 12, 2024 8:29 am

towforce wrote: ↑Thu Sep 12, 2024 12:18 amAs a general thought: graphics cards are not ideal for AI

Which is why GPUs are not actually used for that.
"By tradition, Nvidia still calls the H100 a graphics processing unit, but the term is clearly on its last legs: just two out of the 50+ texture processing clusters (TPCs) in the device are actually cable of running vertex, geometry, and pixel shader maths required to render 3D graphics."
https://www.datacenterknowledge.com/dat ... it-matters

If you compare the performance, you'll notice that the H100 rocks in FP16, relevant for AI, while the 4090 beats the H100 in FP32, used for gaming. The FP64 is interesting as well, used e.g. for scientific high precision simulations, where the H100 leaves the 4090 in the dust. Also, the H100 doesn't even support DirectX11/12. Performance in TFLOPS.
4090: FP16: 82.58, FP32: 82.58, FP64: 1.29.
H100: FP16: 248.3, FP32: 62.08, FP64: 31.04.
https://www.techpowerup.com/gpu-specs/g ... 4090.c3889
https://www.techpowerup.com/gpu-specs/h ... 6-gb.c4164

Ras · Post by **Ras** » Thu Sep 12, 2024 8:41 am

smatovic wrote: ↑Thu Sep 12, 2024 7:22 am~2018 Lc0 took off, and people thought they need a high-end GPU for several thousand dollars to play top notch computer chess. Then 2020 NNUE took off, and a common CPU was sufficient.

This is due to very domain-specific optimisations in chess:

Two consecutive network evaluations have only minimal difference in the inputs, namely the move that has been made in the search tree, which enables differential updates instead of running the full net.
Most of the knowledge is in the very large first layer, which is the one where the optimisation from the previous point can be made.
Integer arithmetics, enabling SIMD instructions, can be used instead of floating point so that the rounding errors from making/unmaking moves with differential updates don't accumulate.

In other words: this optimisation cannot be generalised to other domains or applications, which is why new laptop CPUs come with a dedicated NPU for network inference.

GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021

Re: GPU rumors 2021