Microsoft tries to push Windows on ARM for several reasons, the exclusive MS+Qualcomm deal ends in 2025, so we can expect other players to join that game.
x86 optimized code is an issue, there is Apple's Rosetta 2 and MS Prism, Prism seems to work with SSE code but (yet) not with AVX2.
Snapdragon X Elite & Stockfish 16.1 benchmark?
viewtopic.php?p=965541#p965541
I myself am keen to see RISC-V chips from an African silicon fab in my future laptops/workstations/servers
--
Srdja
GPU rumors 2021
Moderators: hgm, chrisw, Rebel
-
- Posts: 2986
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
-
- Posts: 2986
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU rumors 2021 - layers of latency
On layers of latency, as posted on
https://luddite.app26.de/post/layers-of-latency/
Layers of Latency, Friday, 08 November 2024
A tech buddy asked me why it is so important for China to catch up in chip fabrication process, can't they just put more servers into a data center? In short, it is not that easy.
By shrinking the fab process you can add more transistors onto one chip, and/or run at a higher frequency, and/or lower power consumption.
The fab process is measured in "nm", nanometers. Meanwhile these numbers do not reflect real scales anymore, but transistor density resp. efficiency of fab process.
Simplified, the MOSFET technology was used up to 22nm, this was a 2D planar transistor design, then from 14 to 7nm FinFET 3D structures, and below 7nm GAAFET 3D structures.
Take a look at the 7nm and 3nm fab process for example:
https://en.wikipedia.org/wiki/7_nm_proc ... _offerings
https://en.wikipedia.org/wiki/3_nm_proc ... cess_nodes
Roughly spoken, the 7nm process packs ~100M transistors per mm2, the 3nm process packs ~200M transistors per mm2.
And here the latency steps in. As soon as you leave as programmer the CPU you increase latency, this starts with different levels of caches, goes to RAM, goes to PCIe bus, goes to network...
Source:
Latency Numbers Every Programmer Should Know
https://gist.github.com/jboner/2841832
As a low level programmer you want to stay on CPU and work preferred via the cache. As a GPU programmer there are several layers of parallelism, e.g.:
1. across shader-cores of a single GPU chip (with >10K shader-cores)
2. across multiple chiplets of a single GPU (with currently up to 2 chiplets)
3. across a server node (with up to 8 GPUs)
4. across a pod of nodes (with 256 to 2048 GPUs resp. TPUs)
5. across a cluster of server nodes/pods (with up to 100K GPUs in a single data center)
6. across a grid of clusters/nodes
With each layer adding increasing amounts of latency.
So as a GPU programmer you want ideally to hold your problem space in memory of, and run your algorithm on, a single but thick GPU.
Neural networks for example are a natural fit to run on a GPU, so called embarrassingly easy parallelism,
https://en.wikipedia.org/wiki/Embarrassingly_parallel
but you need to hold the neural network weights in RAM, and therefore couple multiple GPUs together to be able to infer or train networks with billions or trillions of weights resp. parameters. Meanwhile LLMs use techniques like MoE, mixture of experts, so they can distribute the load further. Inference runs for example on a single node with 8 GPUs with up to 16 MoE nodes. The training of LLMs is yet another topic, with further techniques of parallelism so they can distribute the training over thousands of GPUs in a cluster:
1. data parallelism
2. tensor parallelism
3. pipeline parallelism
4. sequence parallelism
And then, power consumption of course. The Colossus supercomputer of the Grok AI with 100K GPUs consumes estimated 100MW power, so it does make a difference if the next fab process delivers the same performance at half the wattage.
Therefore it is important to invest in smaller chip fabrication process, to increase the size of neural networks we are able to infer and train, to lower power consumption, and to increase efficiency.
--
Srdja
https://luddite.app26.de/post/layers-of-latency/
Layers of Latency, Friday, 08 November 2024
A tech buddy asked me why it is so important for China to catch up in chip fabrication process, can't they just put more servers into a data center? In short, it is not that easy.
By shrinking the fab process you can add more transistors onto one chip, and/or run at a higher frequency, and/or lower power consumption.
The fab process is measured in "nm", nanometers. Meanwhile these numbers do not reflect real scales anymore, but transistor density resp. efficiency of fab process.
Simplified, the MOSFET technology was used up to 22nm, this was a 2D planar transistor design, then from 14 to 7nm FinFET 3D structures, and below 7nm GAAFET 3D structures.
Take a look at the 7nm and 3nm fab process for example:
https://en.wikipedia.org/wiki/7_nm_proc ... _offerings
https://en.wikipedia.org/wiki/3_nm_proc ... cess_nodes
Roughly spoken, the 7nm process packs ~100M transistors per mm2, the 3nm process packs ~200M transistors per mm2.
And here the latency steps in. As soon as you leave as programmer the CPU you increase latency, this starts with different levels of caches, goes to RAM, goes to PCIe bus, goes to network...
Code: Select all
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Latency Numbers Every Programmer Should Know
https://gist.github.com/jboner/2841832
As a low level programmer you want to stay on CPU and work preferred via the cache. As a GPU programmer there are several layers of parallelism, e.g.:
1. across shader-cores of a single GPU chip (with >10K shader-cores)
2. across multiple chiplets of a single GPU (with currently up to 2 chiplets)
3. across a server node (with up to 8 GPUs)
4. across a pod of nodes (with 256 to 2048 GPUs resp. TPUs)
5. across a cluster of server nodes/pods (with up to 100K GPUs in a single data center)
6. across a grid of clusters/nodes
With each layer adding increasing amounts of latency.
So as a GPU programmer you want ideally to hold your problem space in memory of, and run your algorithm on, a single but thick GPU.
Neural networks for example are a natural fit to run on a GPU, so called embarrassingly easy parallelism,
https://en.wikipedia.org/wiki/Embarrassingly_parallel
but you need to hold the neural network weights in RAM, and therefore couple multiple GPUs together to be able to infer or train networks with billions or trillions of weights resp. parameters. Meanwhile LLMs use techniques like MoE, mixture of experts, so they can distribute the load further. Inference runs for example on a single node with 8 GPUs with up to 16 MoE nodes. The training of LLMs is yet another topic, with further techniques of parallelism so they can distribute the training over thousands of GPUs in a cluster:
1. data parallelism
2. tensor parallelism
3. pipeline parallelism
4. sequence parallelism
And then, power consumption of course. The Colossus supercomputer of the Grok AI with 100K GPUs consumes estimated 100MW power, so it does make a difference if the next fab process delivers the same performance at half the wattage.
Therefore it is important to invest in smaller chip fabrication process, to increase the size of neural networks we are able to infer and train, to lower power consumption, and to increase efficiency.
--
Srdja
-
- Posts: 2986
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU rumors 2021
Rumor has is it that Zen 6 Ryzen (TSMC 3nm||2nm) will have 12-cores CCD chiplet, past Ryzens topped with 2 chiplets with 8-cores each.
Zen 6 Medusa Ridge processor leaked
https://fudzilla.com/news/graphics/6016 ... sor-leaked
AMD’s Next Gen Zen 6 Ryzen “Medusa” Chips To Support AM5 Socket, Release In 2026-2027
https://computercity.com/hardware/proce ... am5-socket
--
Srdja
-
- Posts: 11988
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: . .
Re: GPU rumors 2021 - layers of latency
Seymour Cray, the legendary supercomputer builder, knew about this: one of his design parameters was to get all the components involved in high speed computing as close together as he could - and he was doing this in the early 1960s!
The simple reveals itself after the complex has been exhausted.
-
- Posts: 2986
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU rumors 2021 - layers of latency
Maybe worth to add:
In the 90s there was the Cray T3D and T3E line with DEC Alpha processors, with up to 2048 resp. 2176 CPUs:
https://en.wikipedia.org/wiki/Cray_T3D
https://en.wikipedia.org/wiki/Cray_T3E
https://www.chessprogramming.org/ABDADA#Frenchess
--
Srdja
https://en.wikipedia.org/wiki/Seymour_C ... _ComputersCray had always resisted the massively parallel solution to high-speed computing, offering a variety of reasons that it would never work as well as one very fast processor. He famously quipped "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" By the mid-1990s, this argument was becoming increasingly difficult to justify, and modern compiler technology made developing programs on such machines not much more difficult than their simpler counterparts.[26]
Cray set up a new company, SRC Computers, and started the design of his own massively parallel machine. The new design concentrated on communications and memory performance, the bottleneck that hampered many parallel designs. Design had just started when Cray was killed in a car accident. SRC Computers carried on development and specialized in reconfigurable computing.
In the 90s there was the Cray T3D and T3E line with DEC Alpha processors, with up to 2048 resp. 2176 CPUs:
https://en.wikipedia.org/wiki/Cray_T3D
https://en.wikipedia.org/wiki/Cray_T3E
https://www.chessprogramming.org/ABDADA#Frenchess
--
Srdja
-
- Posts: 1909
- Joined: Thu Sep 18, 2008 10:24 pm
Re: GPU rumors 2021
Well, having just browsed the Lc0 Discord I am sad to see BT5 has been cancelled for the time being. This makes the latest 5090 much less interesting now
-
- Posts: 626
- Joined: Sun Aug 04, 2013 1:19 pm
-
- Posts: 2986
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU rumors 2021
As far as I got it, the BT series is based on Attention/Transformer network architecture and T series networks on CNN architecture. Do you know if CNN is still in development or did they switch completely to Attention?
--
Srdja