Blackwell GPU

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Leo
Posts: 1112
Joined: Fri Sep 16, 2016 6:55 pm
Location: USA/Minnesota
Full name: Leo Anger

Blackwell GPU

Post by Leo »

Any reports on Blackwell GPUs for chess engine training?
Advanced Micro Devices fan.
User avatar
towforce
Posts: 13120
Joined: Thu Mar 09, 2006 12:57 am
Location: Birmingham UK
Full name: Graham Laight

Re: Blackwell GPU

Post by towforce »

Given that nobody has given an answer, I am going to give the Gemini 3.5 Flash answer on the grounds that "it's better than nothing".

My prompt: How good is the Blackwell GPU likely to be at training neural networks in comparison to other GPUs in a similar price range?

This is Gemini's response, unedited because I have low expertise in GPUs. Summary: expect it to be good!


The short answer is that **NVIDIA's Blackwell architecture (specifically the B200 and the upgraded B300 "Blackwell Ultra") completely redefines the price-to-performance curve for training neural networks.** In the enterprise and data center market, "price range" is typically analyzed through two lenses: **hardware purchase cost** (capex) and **cloud rental pricing** per hour (opex). When evaluated against other hardware in its immediate cost bracket, Blackwell consistently outperforms the competition.

---

## 1. Blackwell vs. The Prior Generation (Hopper H100 / H200)

When Blackwell launched, its cloud rental and purchase prices commanded a premium over the older Hopper architecture. However, because its architectural leap is so massive, **the cost-per-experiment dropped dramatically.**

* **Training Speed Uplift:** Microbenchmarks and MLPerf v5.1 training results show that a B200 cluster delivers **1.55× to 3× faster raw training throughput** for dense Transformer models (like GPT and Llama) compared to the H200, and up to a **3× speedup** over the base H100 at scale.
* **The Math Behind the Price:** If renting an H100 costs ~$2.00–$3.00/hour on spot/specialist clouds, and a B200 spots for roughly $2.00–$4.00/hour, the math tilts heavily in Blackwell's favor. For an extra 30% to 50% in hourly cost, you achieve a 150% to 200% increase in training speed. This means your "wall-clock time" to train a model is slashed, resulting in a **lower net dollar cost to finish the training run.**
* **Memory & Communication Bottlenecks:** The Blackwell B200 features 192GB of HBM3e memory with a massive 8 TB/s bandwidth, while the B300 Ultra bumps that to 288GB. This massive capacity allows larger LLMs (like 70B parameter models) to fit entirely onto fewer GPUs during fine-tuning (e.g., using QLoRA). This removes the agonizing multi-GPU tensor parallelism communication bottlenecks that plague older 80GB H100s.

---

## 2. Blackwell vs. Competitors (AMD Instinct & Google TPUs)

When looking at external competitors in a similar enterprise price bracket, the narrative shifts slightly from raw hardware capability to ecosystem maturation.

### AMD Instinct (MI300X / MI325X)

AMD traditionally competes by offering **more raw memory for less money**. The MI300X series launched with 192GB of HBM3 memory, directly beating the H100's capacity at a lower price point.

* **The Comparison:** While AMD offers incredibly fierce competition on a pure "dollar-per-gigabyte-of-VRAM" basis—making them excellent for cost-effective inference and mid-scale training—Blackwell counters this with architectural innovations.
* Blackwell features **5th-generation Tensor Cores** that decouple from traditional warp-synchronous scheduling, utilizing independent thread execution. Combined with a significantly more mature **CUDA** software stack, Blackwell generally achieves vastly higher hardware utilization rates in PyTorch during training compared to AMD's ROCm ecosystem, though AMD continues to close that gap.

### Google TPUs (v5p / v6)

Google TPUs are incredibly price-competitive if you operate strictly within Google Cloud (GCP).

* **The Comparison:** For native TensorFlow or highly optimized JAX/PyTorch pipelines, Google TPUs offer an outstanding performance-to-price ratio. However, Blackwell retains the crown for **flexibility**. Blackwell chips can be deployed across any cloud provider, on-premise "AI Factories", or hybrid architectures. Furthermore, Blackwell's **NVLink 5 interconnect** (offering 1.8 TB/s bidirectional bandwidth per GPU) scales up to massive 576-GPU non-blocking clusters, giving it a scaling efficiency edge when training trillion-parameter frontiers.

---

## 3. The Secret Weapon: NVFP4 and the Transformer Engine

The primary reason Blackwell outperforms anything else in its price tier is its **Second-Generation Transformer Engine** supporting **native 4-bit floating-point (FP4)** precision.

Historically, neural networks were trained in FP16 or FP8 to prevent destabilization. Blackwell introduces an explicit hardware-accelerated format (`NVFP4`).

* **The Impact:** Blackwell can execute training recipes that leverage FP4 compute without sacrificing model accuracy.
* Peak FP4 throughput per clock on Blackwell is **twice that of FP8** (and up to three times on Blackwell Ultra).
* By dropping the precision down to 4-bit during math-bound GEMM (General Matrix Multiply) operations, Blackwell processes data exponentially faster, making it an undisputed performance leader for teams training massive foundation models.

---

## Summary Verdict

If you are training **small-to-medium models (7B to 13B parameters)** or running simple fine-tuning, renting an older **H100** remains the absolute "value-tier" cost leader.

However, if you are looking at the mid-to-high enterprise price range for **large-scale pre-training or massive fine-tuning (70B+ models)**, the Blackwell B200/B300 series is practically unrivaled. The raw architectural throughput, the massive memory bandwidth, and the adoption of FP4 mean that **every dollar spent on Blackwell translates to significantly more neural network training progression** than the equivalent dollar spent on any other GPU architecture on the market.
Human chess is partly about tactics and strategy, but mostly about memory
Leo
Posts: 1112
Joined: Fri Sep 16, 2016 6:55 pm
Location: USA/Minnesota
Full name: Leo Anger

Re: Blackwell GPU

Post by Leo »

Looks great. Thanks.
Advanced Micro Devices fan.
Werewolf
Posts: 2089
Joined: Thu Sep 18, 2008 10:24 pm
Full name: Carl Bicknell

Re: Blackwell GPU

Post by Werewolf »

Leo wrote: Wed May 27, 2026 6:58 pm Any reports on Blackwell GPUs for chess engine training?
I've used a 5090 for several months for training with good results. Are you building your own pipeline? If so that took me way longer than training, but now it's setup...one will always wish for more speed. Why not rent online so you can test for yourself first?
Leo
Posts: 1112
Joined: Fri Sep 16, 2016 6:55 pm
Location: USA/Minnesota
Full name: Leo Anger

Re: Blackwell GPU

Post by Leo »

I dont own GPUS. I was only wondering how the new Blackwell is doing.
Advanced Micro Devices fan.