How GPUs Are Shared Training LLMs: Parallelism Explained

When a company announces it has trained a new large language model (LLM), the headline number that grabs attention is often parameter count — 70 billion, 400 billion, 1 trillion. What rarely makes the headline is the staggering amount of GPU coordination required to get there. Training modern LLMs is a feat of distributed computing unlike anything else in software engineering.

A single NVIDIA H100 GPU has 16,896 CUDA cores and 80 GB of HBM3 memory. It is extraordinarily powerful. But it isn't nearly enough. To train GPT-5-scale models, researchers connect thousands of these cards together and teach them to collaborate — splitting, sharing, and synchronising work at microsecond timescales across miles of fiber optic cable.

"Training Grok 4 used xAI's Colossus — a 200,000 GPU cluster running reinforcement learning at a scale an order of magnitude larger than any prior training run."
— xAI, Grok 4 technical release (July 2025)

This article walks you through exactly how that coordination works: the hardware, the parallelism strategies, the communication protocols, and what it all looks like in modern AI clusters running models like GPT-5 (August 2025), Gemini 3 (November 2025), Grok 4 (July 2025), and Claude Opus 4.6 (February 2026).

Why One GPU Isn't Enough

The core problem is memory. A model parameter stored in FP16 takes 2 bytes. A 70-billion-parameter model like Llama 3 70B therefore requires 140 GB of GPU memory just for the weights — before you account for gradients, optimizer states, and activations, which multiply that figure by 4x to 8x. Training Llama 3 70B requires roughly 1.1 TB of GPU memory spread across many cards.

Visual 01

GPU Memory Requirements for Training vs. Inference

Llama 3 8B

~128 GB

Llama 3 70B

~1.1 TB

GPT-5 (est.)

~10–16 TB

Gemini 3 (est.)

~15–28 TB

Grok 4 (est.)

~10–14 TB

Memory shown is for full training state (weights + gradients + optimizer). Inference uses 4–8× less.

Even ignoring memory, compute throughput is a bottleneck. Training a 70B model on 1 trillion tokens would take a single H100 approximately 4,000 years. Scaling to 8,192 GPUs brings that to weeks.

Inside a GPU: CUDA Cores, Tensor Cores & SMs

Before understanding how GPUs are shared across a cluster, it helps to understand what's happening inside one GPU. NVIDIA's H100 isn't a monolithic chip — it's an intricate hierarchy of compute units.

The fundamental unit is the Streaming Multiprocessor (SM). The H100 has 132 SMs, each containing 128 CUDA cores and 4 Tensor Core units, for a total of 16,896 CUDA cores. CUDA cores handle general floating-point and integer math. Tensor Cores, introduced with Volta and dramatically improved since, are specifically designed for matrix multiply-accumulate (MMA) operations — the exact operation that dominates transformer training.

When a training step begins, the framework (PyTorch or JAX) dispatches thousands of thread blocks to SMs in parallel. Each SM maintains its own instruction scheduler, L1 cache, and shared memory — so thousands of matrix operations run simultaneously across the chip without coordination overhead.

The Three Core Parallelism Strategies

When training demands exceed a single GPU's capacity, frameworks like Megatron-LM, DeepSpeed, and FSDP (Fully Sharded Data Parallel) orchestrate three fundamental forms of parallelism. Most production training runs combine all three — this is called 3D parallelism.

1. Data Parallelism (DP)

The simplest approach. Every GPU has a complete copy of the model. Each GPU processes a different mini-batch of training data, computes gradients independently, and then those gradients are averaged across all GPUs using an AllReduce operation before weights are updated. At 8,192 GPUs, this communication is non-trivial — NVIDIA's NVLink fabric can move 900 GB/s between cards, while InfiniBand HDR links clusters at 400 Gbps across nodes.

Data parallelism alone doesn't help with the memory problem — each GPU still needs the full model. That's where tensor and pipeline parallelism come in.

2. Tensor Parallelism (TP)

Introduced by Megatron-LM (NVIDIA, 2019), tensor parallelism splits individual weight matrices across GPUs. In a transformer's attention layer, the Q, K, V projection matrices can be partitioned column-wise across GPUs. Each GPU computes part of the attention, then results are gathered with an AllGather. This keeps memory proportional to 1/N of the model on N GPUs.

Tensor parallelism is typically applied within a single node (8 GPUs) because it requires very high bandwidth — NVLink's 900 GB/s per GPU is necessary to keep the communication from stalling compute.

3. Pipeline Parallelism (PP)

In pipeline parallelism, the model's layers are divided into stages, each stage hosted on a different GPU (or group of GPUs). A batch of input data flows through Stage 1, producing activations that are passed to Stage 2, and so on.

Visual 04

Pipeline Parallelism — Transformer Layers Across 4 GPU Stages

Layers 1–8
GPU A

Embedding + Early Attention

Layers 9–16
GPU B

Mid Attention + FFN

Layers 17–24
GPU C

Deep Attention

Layers 25–32
GPU D

Output + LM Head

Micro-batching (GPipe / PipeDream schedule) keeps all stages busy simultaneously, minimizing idle "pipeline bubbles".

Modern GPU Hardware for LLM Training

Visual 05

Leading GPUs for LLM Training (2024–2025)

GPU	CUDA Cores	Memory	Bandwidth	FP8 TFLOPS	Interconnect
NVIDIA H100 SXM5	16,896	80 GB HBM3	3.35 TB/s	3,958	NVLink 4.0 (900 GB/s)
NVIDIA H200	16,896	141 GB HBM3e	4.8 TB/s	3,958	NVLink 4.0 (900 GB/s)
NVIDIA B200 (Blackwell)	~21,000+	192 GB HBM3e	8.0 TB/s	9,000+	NVLink 5.0 (1,800 GB/s)
AMD MI300X	14,080	192 GB HBM3	5.3 TB/s	5,220	Infinity Fabric
Google TPU v5p	—	96 GB HBM2e	2.76 TB/s	918 (bfloat16)	ICI (1.6 Tbps)

The NVIDIA B200 (Blackwell architecture), introduced in 2024, represents the current frontier. By combining two B100 dies into a single NVL72 rack unit with 72 GPUs sharing a unified memory fabric, NVIDIA enables configurations where 72 GPUs collectively have 13.5 TB of pooled memory — enough to hold most frontier models without complex sharding.

The Communication Layer: AllReduce, NVLink & InfiniBand

Parallelism only works if GPUs can communicate efficiently. Three layers of interconnect matter:

NVLink (within a node): Connects 8 GPUs on a DGX H100 with 900 GB/s of bidirectional bandwidth per GPU. Critical for tensor parallelism which requires constant, high-bandwidth partial-result aggregation.

InfiniBand HDR/NDR (between nodes): Connects nodes across a data center at 400 Gbps (HDR) or 800 Gbps (NDR). Used for data parallel AllReduce across thousands of nodes. Latency is microseconds, but at 10,000 GPUs, even tiny delays compound.

RDMA (Remote Direct Memory Access): Allows GPUs to write directly into another GPU's memory on a different server without CPU involvement — critical for keeping gradient synchronization from stalling the training pipeline.

The AllReduce Operation

After each forward+backward pass, every GPU in the data-parallel group has its own gradient tensor. To synchronize, they run Ring-AllReduce: each GPU sends its gradient chunk to its neighbor and receives from the other side, until after N rounds, every GPU has the averaged gradient. NCCL (NVIDIA Collective Communications Library) handles this automatically and efficiently, overlapping communication with the backward pass computation.

At 10,000 GPUs exchanging gradients for a 70B-parameter model (140 GB per step), this requires moving ~280 GB of data per step — every few seconds. This is why inter-node bandwidth is as critical as raw compute.

In practice, well-optimised training frameworks overlap AllReduce communication with the backward pass computation. As soon as gradients for the earliest layers are computed, AllReduce starts for those layers while the backward pass continues through later layers. By the time backward is done, a significant portion of the AllReduce is already complete. This overlap — measured as the "communication-computation overlap ratio" — is one of the key metrics separating efficient training runs from wasteful ones. NVIDIA's Megatron-LM and Microsoft's DeepSpeed both implement this overlap automatically, but it requires careful tuning of bucket sizes and communication schedules to achieve high efficiency at scale.

Real-World Training Clusters: 2025's Frontier Models

What do these principles look like in practice? Here are the known or estimated configurations behind the frontier model training runs of 2025:

Claude Opus 4 / 4.5 / 4.6 (Anthropic, 2025): Anthropic has not disclosed precise cluster sizes, but the Claude 4 series (Opus 4 released May 2025, Opus 4.5 November 2025, Opus 4.6 February 2026) is known to use a mix of AWS Trainium and NVIDIA H100 clusters. Opus 4 was classified as an ASL-3 model — Anthropic's highest safety tier — reflecting both its capabilities and the scale of compute required. Claude Opus 4.6 achieves 65.4% on Terminal-Bench 2.0 and 72.7% on OSWorld, suggesting a training run of tens of thousands of GPUs over several weeks.

OpenAI GPT-5 (August 7, 2025): GPT-5 was trained on Microsoft Azure AI supercomputers. Trained as a unified system combining fast and deep-reasoning modes, GPT-5 represents OpenAI's largest training run to date. The model's performance — comparable to experts in roughly half of over-40-occupation knowledge work tasks — suggests a compute budget far exceeding GPT-4's estimated 25,000–30,000 A100 GPUs. GPT-5.2, released December 2025, extended this further with additional post-training.

Google Gemini 3 (November 18, 2025): Gemini 3 Pro scored 1501 Elo on LMArena and achieved 91.9% on GPQA Diamond. Trained on Google's TPU v5p pods connected via the ICI (Inter-Chip Interconnect) fabric — a proprietary high-speed network that bakes the interconnect directly into the hardware, giving Google a significant communication advantage over GPU clusters relying on external InfiniBand. Since releasing Gemini 3, Google has processed over 1 trillion tokens per day on its API.

xAI Grok 4 (July 9, 2025): The most publicly detailed of 2025's training runs. For Grok 4, xAI utilized Colossus, their 200,000 GPU cluster, to run reinforcement learning training that refines Grok's reasoning abilities at pretraining scale — a training run over an order of magnitude larger than anything previously attempted. Grok 4 Heavy additionally uses a multi-agent ensemble approach, where multiple model instances collaborate and compare answers — adding another layer of GPU coordination on top of standard 3D parallelism. Grok 4.1, released November 2025, followed with improvements in reasoning and multimodal understanding.

Memory Optimization: How Engineers Squeeze More Into Less

Raw hardware is only part of the picture. Software techniques dramatically reduce memory requirements:

Gradient Checkpointing (Activation Recomputation): Instead of storing all intermediate activations during the forward pass (needed for backprop), the training loop stores only checkpoints and recomputes activations on-demand during backpropagation. This trades compute for memory — typically increasing compute by ~30% while halving activation memory.

Mixed Precision Training (FP16/BF16/FP8): Weights and activations are stored in 16-bit or 8-bit formats during forward and backward passes, with a 32-bit master copy of weights kept for the optimizer update. BF16 (Brain Float 16) has become standard for LLM training because it preserves the dynamic range of FP32 while halving memory.

ZeRO (Zero Redundancy Optimizer): Developed by Microsoft for DeepSpeed, ZeRO partitions the optimizer states, gradients, and parameters across data-parallel GPUs — eliminating the memory redundancy that standard DP incurs. ZeRO Stage 3 (fully sharded) is equivalent to PyTorch FSDP and can reduce per-GPU memory by up to 8× compared to naive data parallelism.

Flash Attention: A memory-efficient attention algorithm (Dao et al., 2022, updated in 2023) that fuses the attention computation into a single GPU kernel, dramatically reducing the O(n²) memory cost of attention while also being faster due to better use of L2 cache and SRAM. Virtually every production LLM training run today uses Flash Attention 2 or 3.

Scheduling: Who Decides What Each GPU Does?

Above the hardware sits the orchestration layer. In large clusters, this is a complex job. A typical stack looks like: Kubernetes or Slurm for cluster-level scheduling (assigning jobs to nodes), NCCL for collective communication, Megatron-LM or DeepSpeed for model-level parallelism strategy, and PyTorch or JAX as the deep learning framework.

When a training job launches, the framework determines the parallelism configuration (e.g., TP=8, PP=4, DP=2048 for a 65,536-GPU run), assigns each GPU its role and its slice of the model, and begins the training loop. Each GPU knows exactly which layers it owns, which GPUs are its tensor-parallel siblings, and which are its pipeline predecessors and successors.

What's Next: GB200, Optical Interconnects & Beyond

The trajectory is clear: more GPUs, faster interconnects, and smarter orchestration. NVIDIA's GB200 NVL72 rack — 72 Blackwell GPUs in a single rack, connected by NVLink 5.0 at 1,800 GB/s per GPU — blurs the line between "cluster" and "single computer." The 72 GPUs share a unified memory address space, making tensor parallelism within the rack as fast as intra-chip communication.

Beyond NVLink, research is advancing on optical chip-to-chip interconnects that could eventually provide terabit-per-second bandwidth between nodes with femtojoule-level energy per bit — potentially eliminating InfiniBand as a bottleneck entirely.

Mixture of Experts (MoE) architectures, used in Mixtral, Grok 4 Heavy's multi-agent ensemble approach, and (reportedly) GPT-5's routing system, add another dimension to GPU sharing: not all parameters are activated for each token. Expert routing dynamically selects a subset of the model's weight matrices per token, meaning GPUs hosting different experts experience different loads — a new scheduling challenge that frameworks are still learning to handle efficiently.

Conclusion

Training a frontier LLM is one of the most complex distributed computing challenges humanity has undertaken. It requires careful orchestration of thousands of GPUs — each with tens of thousands of cores — across multiple levels of parallelism, connected by custom networking fabric, and managed by sophisticated software stacks. The interplay of data parallelism, tensor parallelism, pipeline parallelism, memory optimization, and collective communication is what makes training at scale possible.

The 2025 frontier — GPT-5, Gemini 3, Grok 4, and Claude Opus 4.6 — represents the current state of the art. Grok 4's Colossus cluster (200,000 GPUs), GPT-5 trained on Azure supercomputers, and Gemini 3's proprietary TPU fabric all represent distinct architectural bets on how to coordinate hundreds of thousands of chips. As models continue to grow and clusters push toward millions of chips, the engineering of distributed training will remain one of the most consequential disciplines in modern AI development.

Practical Implications for Engineers

Understanding distributed training at this level has immediate practical consequences for anyone building or evaluating AI systems. When a company claims a model was trained on "X GPU-hours," that number hides enormous complexity. The same GPU-hours on a poorly optimised cluster with high pipeline bubbles and slow AllReduce operations will produce a worse-trained model than the same compute on a well-orchestrated 3D-parallel setup with overlapped communication. Compute efficiency — what fraction of theoretical peak FLOPS is actually used productively — varies from under 30% on naive setups to over 60% on well-tuned frontier clusters.

For engineers evaluating whether to fine-tune a large model or train a smaller one from scratch: the memory arithmetic of distributed training is your primary constraint. A 70B model requires roughly 1.1 TB of GPU memory for training state — 8 H100s (640 GB total) with ZeRO Stage 3 and gradient checkpointing can just about handle it, but you will be constrained on batch size and sequence length. Understanding tensor parallelism degree (how many GPUs split each weight matrix) and pipeline stages (how many GPUs handle sequential layer groups) determines not just whether training fits in memory, but how fast communication overhead grows.

For those tracking the economics of frontier AI: the shift from A100 to H100 clusters doubled effective training throughput per dollar partly through better compute (2× on transformer workloads) and partly through better interconnect (NVLink 4.0 at 900 GB/s vs 600 GB/s). The upcoming GB200 NVL72 rack — 72 Blackwell GPUs with 1,800 GB/s NVLink 5.0 — is expected to double throughput again, potentially enabling models that would take years on today's hardware to be trained in months. The training race is fundamentally a hardware race, and the hardware is moving faster than ever.