GPU rumors 2021

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, chrisw, Rebel

smatovic
Posts: 2986
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2021

Post by smatovic »

Microsoft tries to push Windows on ARM for several reasons, the exclusive MS+Qualcomm deal ends in 2025, so we can expect other players to join that game.

x86 optimized code is an issue, there is Apple's Rosetta 2 and MS Prism, Prism seems to work with SSE code but (yet) not with AVX2.

Snapdragon X Elite & Stockfish 16.1 benchmark?
viewtopic.php?p=965541#p965541

I myself am keen to see RISC-V chips from an African silicon fab in my future laptops/workstations/servers ;)

--
Srdja
smatovic
Posts: 2986
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2021 - layers of latency

Post by smatovic »

On layers of latency, as posted on
https://luddite.app26.de/post/layers-of-latency/

Layers of Latency, Friday, 08 November 2024

A tech buddy asked me why it is so important for China to catch up in chip fabrication process, can't they just put more servers into a data center? In short, it is not that easy.

By shrinking the fab process you can add more transistors onto one chip, and/or run at a higher frequency, and/or lower power consumption.

The fab process is measured in "nm", nanometers. Meanwhile these numbers do not reflect real scales anymore, but transistor density resp. efficiency of fab process.

Simplified, the MOSFET technology was used up to 22nm, this was a 2D planar transistor design, then from 14 to 7nm FinFET 3D structures, and below 7nm GAAFET 3D structures.

Take a look at the 7nm and 3nm fab process for example:
https://en.wikipedia.org/wiki/7_nm_proc ... _offerings
https://en.wikipedia.org/wiki/3_nm_proc ... cess_nodes

Roughly spoken, the 7nm process packs ~100M transistors per mm2, the 3nm process packs ~200M transistors per mm2.

And here the latency steps in. As soon as you leave as programmer the CPU you increase latency, this starts with different levels of caches, goes to RAM, goes to PCIe bus, goes to network...

Code: Select all

Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
L2 cache reference                           7   ns
Main memory reference                      100   ns
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
Source:
Latency Numbers Every Programmer Should Know
https://gist.github.com/jboner/2841832

As a low level programmer you want to stay on CPU and work preferred via the cache. As a GPU programmer there are several layers of parallelism, e.g.:

1. across shader-cores of a single GPU chip (with >10K shader-cores)
2. across multiple chiplets of a single GPU (with currently up to 2 chiplets)
3. across a server node (with up to 8 GPUs)
4. across a pod of nodes (with 256 to 2048 GPUs resp. TPUs)
5. across a cluster of server nodes/pods (with up to 100K GPUs in a single data center)
6. across a grid of clusters/nodes

With each layer adding increasing amounts of latency.

So as a GPU programmer you want ideally to hold your problem space in memory of, and run your algorithm on, a single but thick GPU.

Neural networks for example are a natural fit to run on a GPU, so called embarrassingly easy parallelism,

https://en.wikipedia.org/wiki/Embarrassingly_parallel

but you need to hold the neural network weights in RAM, and therefore couple multiple GPUs together to be able to infer or train networks with billions or trillions of weights resp. parameters. Meanwhile LLMs use techniques like MoE, mixture of experts, so they can distribute the load further. Inference runs for example on a single node with 8 GPUs with up to 16 MoE nodes. The training of LLMs is yet another topic, with further techniques of parallelism so they can distribute the training over thousands of GPUs in a cluster:

1. data parallelism
2. tensor parallelism
3. pipeline parallelism
4. sequence parallelism

And then, power consumption of course. The Colossus supercomputer of the Grok AI with 100K GPUs consumes estimated 100MW power, so it does make a difference if the next fab process delivers the same performance at half the wattage.

Therefore it is important to invest in smaller chip fabrication process, to increase the size of neural networks we are able to infer and train, to lower power consumption, and to increase efficiency.

--
Srdja
smatovic
Posts: 2986
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2021

Post by smatovic »

Werewolf wrote: Wed Jun 05, 2024 4:50 pm [...]
Slight shame there's no raise from 16 cores, given that Turin Threadripper won't be out in 2024.
Rumor has is it that Zen 6 Ryzen (TSMC 3nm||2nm) will have 12-cores CCD chiplet, past Ryzens topped with 2 chiplets with 8-cores each.

Zen 6 Medusa Ridge processor leaked
https://fudzilla.com/news/graphics/6016 ... sor-leaked

AMD’s Next Gen Zen 6 Ryzen “Medusa” Chips To Support AM5 Socket, Release In 2026-2027
https://computercity.com/hardware/proce ... am5-socket

--
Srdja
User avatar
towforce
Posts: 11988
Joined: Thu Mar 09, 2006 12:57 am
Location: Birmingham UK
Full name: . .

Re: GPU rumors 2021 - layers of latency

Post by towforce »

smatovic wrote: Sun Dec 01, 2024 9:53 amA tech buddy asked me why it is so important for China to catch up in chip fabrication process, can't they just put more servers into a data center? In short, it is not that easy.

Seymour Cray, the legendary supercomputer builder, knew about this: one of his design parameters was to get all the components involved in high speed computing as close together as he could - and he was doing this in the early 1960s!
The simple reveals itself after the complex has been exhausted.
smatovic
Posts: 2986
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2021 - layers of latency

Post by smatovic »

Maybe worth to add:
Cray had always resisted the massively parallel solution to high-speed computing, offering a variety of reasons that it would never work as well as one very fast processor. He famously quipped "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" By the mid-1990s, this argument was becoming increasingly difficult to justify, and modern compiler technology made developing programs on such machines not much more difficult than their simpler counterparts.[26]

Cray set up a new company, SRC Computers, and started the design of his own massively parallel machine. The new design concentrated on communications and memory performance, the bottleneck that hampered many parallel designs. Design had just started when Cray was killed in a car accident. SRC Computers carried on development and specialized in reconfigurable computing.
https://en.wikipedia.org/wiki/Seymour_C ... _Computers

In the 90s there was the Cray T3D and T3E line with DEC Alpha processors, with up to 2048 resp. 2176 CPUs:

https://en.wikipedia.org/wiki/Cray_T3D
https://en.wikipedia.org/wiki/Cray_T3E

https://www.chessprogramming.org/ABDADA#Frenchess

--
Srdja
Werewolf
Posts: 1909
Joined: Thu Sep 18, 2008 10:24 pm

Re: GPU rumors 2021

Post by Werewolf »

Well, having just browsed the Lc0 Discord I am sad to see BT5 has been cancelled for the time being. This makes the latest 5090 much less interesting now :cry:
Hai
Posts: 626
Joined: Sun Aug 04, 2013 1:19 pm

Re: GPU rumors 2021

Post by Hai »

Werewolf wrote: Mon Dec 02, 2024 3:32 pm Well, having just browsed the Lc0 Discord I am sad to see BT5 has been cancelled for the time being. This makes the latest 5090 much less interesting now :cry:
This makes the MacBook Pro 16-inch M4 MAX and the Mac Studio M4 ULTRA much more interesting.
smatovic
Posts: 2986
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2021

Post by smatovic »

Werewolf wrote: Mon Dec 02, 2024 3:32 pm Well, having just browsed the Lc0 Discord I am sad to see BT5 has been cancelled for the time being. This makes the latest 5090 much less interesting now :cry:
As far as I got it, the BT series is based on Attention/Transformer network architecture and T series networks on CNN architecture. Do you know if CNN is still in development or did they switch completely to Attention?

--
Srdja