A Practitioner's Guide to Distributed Training Parallelism

Training a large model isn’t hard because the math is complicated. It’s hard because the model doesn’t fit. A 70B transformer needs ~140 GB just to store its fp16 weights — add gradients and optimizer state and you’re at 1.12 TB. An 80 GB A100 can’t hold it. So you split the work across GPUs.

There are five ways to do that split: DDP, FSDP, Pipeline, Tensor, and 3D parallelism. Each makes a different tradeoff between memory, communication, and utilization — and choosing wrong means half your GPUs idle while the other half wait.

The explorer below lets you feel those tradeoffs directly. Pick any model from GPT-2 to LLaMA 405B, choose a GPU cluster, and switch between strategies — watch memory bars shift, see the OOM indicator flip, run the training simulation to observe GPU temperatures through each phase. Spend a minute here before reading the deep dives; the concepts will land much faster once you’ve seen the numbers move.

Loading Parallelism Explorer…

Below, I go deep on each strategy: what it actually does to your GPUs, where the hidden costs are, and when to reach for it.

1. Distributed Data Parallel (DDP)

DDP is the simplest strategy and where almost everyone should start. Every GPU gets a complete copy of the model. The training data is split so each GPU processes a different mini-batch. After the backward pass, gradients are averaged across all GPUs via an AllReduce, then every GPU applies the same optimizer step with the same averaged gradients — keeping the replicas in sync.

Figure 1 — Distributed Data Parallel (DDP)

DDP replicates the full model on every GPU. Each GPU processes a different data shard. Gradients are averaged via a ring-AllReduce after the backward pass.

Practical nuances

Why it’s fast. PyTorch DDP overlaps the gradient AllReduce with the backward pass. The moment a layer’s gradient is ready, it’s bucketed and communication starts — while the next layer is still computing. This overlap means the communication cost is largely hidden behind computation for reasonably-sized models.

Where it breaks. DDP doesn’t reduce memory at all. Every GPU stores the full model parameters, full gradients, and full optimizer states. For a 7B model with mixed-precision Adam, that’s 6.7B × 16 bytes = ~107 GB per GPU — already over the 80 GB limit of an A100/H100. The moment the model exceeds single-GPU memory, DDP is out.

The scaling ceiling. DDP throughput scales nearly linearly up to 256–512 GPUs for large batch sizes. Beyond that, the AllReduce communication volume (proportional to model size) starts to dominate, especially across multi-node InfiniBand links. Small models on many GPUs hit this ceiling faster because the compute-to-communication ratio is lower.

Gotcha: uneven batch sizes. If your dataset doesn’t divide evenly by the number of GPUs, the last GPU gets a smaller batch. DDP still AllReduces the gradients, so the last GPU’s gradients are averaged with full-batch GPUs, introducing a slight bias. Use drop_last=True or pad to avoid this.

2. Fully Sharded Data Parallel (FSDP)

When DDP runs out of memory, FSDP is the next step. Based on Microsoft’s ZeRO (Zero Redundancy Optimizer) paper, FSDP shards the model parameters, gradients, and optimizer states across all GPUs. Each GPU holds only 1/N of each tensor.

The trick: before computing a layer’s forward pass, FSDP runs an AllGather to temporarily reconstruct the full layer’s parameters on every GPU. After the backward pass, a ReduceScatter distributes each GPU’s gradient shard back to its owner. The full-layer buffer is freed immediately — so only one layer’s worth of temporary memory is used at any time.

Figure 2 — Fully Sharded Data Parallel (FSDP / ZeRO-3)

FSDP shards parameters, gradients, and optimizer states across GPUs. Before each layer's forward pass, an All-Gather reconstructs the full parameters temporarily. After backward, a Reduce-Scatter returns each GPU's gradient shard.

Practical nuances

The three ZeRO stages. FSDP implements ZeRO-3 (full sharding), but the concept has three levels. ZeRO-1 shards only optimizer states (saving ~4× on Adam overhead). ZeRO-2 also shards gradients. ZeRO-3 shards everything including parameters. Each level adds communication but reduces memory. In practice, most people jump straight to ZeRO-3/FSDP because the memory savings compound dramatically — a 70B model that needs 1.12 TB with DDP fits in 70 GB per GPU with FSDP across 16 devices.

Communication is 1.5× DDP, not 3×. People sometimes assume FSDP must be much slower because it does AllGather + ReduceScatter instead of just AllReduce. In practice, the total communication volume is about 1.5× DDP’s AllReduce, and modern NCCL implementations overlap the next layer’s AllGather with the current layer’s computation. The real overhead is typically 5–15%, not 50%.

The temporary buffer spike. During the AllGather, each GPU briefly holds a full layer’s parameters in memory. For a model with very large layers (e.g., an 18432-dim Gemini layer), this temporary buffer can be significant. If you’re right at the memory edge, this spike can cause OOM even though the steady-state usage fits. Watch torch.cuda.max_memory_allocated(), not just memory_allocated().

FSDP vs DeepSpeed ZeRO. Both implement the same algorithm. PyTorch FSDP is native and tightly integrated with torch.compile. DeepSpeed ZeRO has been around longer and supports CPU offloading (ZeRO-Offload) and NVMe offloading (ZeRO-Infinity) for extreme memory savings at the cost of throughput. For most GPU-only training, native FSDP is the cleaner choice.

3. Pipeline Parallelism (PP)

Pipeline parallelism splits the model vertically — by layers. GPU 0 runs layers 0–7, GPU 1 runs layers 8–15, and so on. Each GPU passes its output activations to the next stage, like an assembly line.

Figure 3 — Pipeline Parallelism

Pipeline parallelism splits the model by layers into stages. Activations flow stage-to-stage. The "bubble" slots show idle time — later stages must wait for earlier ones to finish. Micro-batching fills these gaps by overlapping multiple mini-batches.

Practical nuances

The bubble tax is steep. With 4 pipeline stages and 4 micro-batches, the bubble fraction is (4-1)/(4+4-1) = 43% — nearly half your compute is wasted. With 8 micro-batches it’s 30%, and with 16 it’s 19%. You need at least 4× as many micro-batches as stages to get the bubble below 20%. This is why pipeline parallelism is rarely used alone — it’s almost always combined with data parallelism (to increase the micro-batch count) or tensor parallelism (to reduce the stage count).

Memory isn’t balanced. The first and last pipeline stages often use more memory than middle stages. The first stage stores the embedding layer’s activations for all micro-batches in flight. The last stage stores the loss computation and the start of the backward pass. If you’re OOM on stage 0 but have headroom on stage 2, you’re pipeline-imbalanced and losing efficiency.

Point-to-point communication is cheap. Unlike DDP’s AllReduce (which moves the entire model’s gradients), PP only sends activation tensors between adjacent stages. For a 4096-dimensional transformer, that’s just batch_size × seq_len × 4096 × 2 bytes per micro-batch — orders of magnitude less than a full gradient AllReduce. This makes PP work well even over relatively slow inter-node InfiniBand links.

1F1B schedule. The naïve GPipe schedule shown above runs all forward passes before all backward passes, maximizing the activation memory (all micro-batches’ activations stored simultaneously). The PipeDream-Flush/1F1B schedule interleaves forward and backward passes: after the pipeline fills, each stage alternates one-forward-one-backward, freeing activations earlier and reducing peak memory by ~(stages-1)/stages.

4. Tensor Parallelism (TP)

Tensor parallelism goes inside individual layers. Instead of assigning whole layers to different GPUs, it splits each weight matrix across GPUs. For a linear layer Y = X·W, the weight W is split column-wise so GPU i holds W[:,i] and computes X·W[:,i]. An AllReduce then sums the partial results across GPUs to produce the full output.

Figure 4 — Tensor Parallelism

Tensor parallelism splits weight matrices column-wise across GPUs. Each GPU computes a partial product X·W[:,i], then an AllReduce sums the results to produce the full output Y. This happens for every layer, demanding NVLink-class interconnect bandwidth.

Practical nuances

NVLink or nothing. Every layer in the transformer triggers two AllReduce operations (one in the attention block, one in the MLP). For a 96-layer GPT-3, that’s 192 AllReduces per iteration. Over NVLink (600–900 GB/s), this is fast. Over InfiniBand (50–100 GB/s), it’s a disaster. This is why TP is almost exclusively used within a single node (8 GPUs on NVLink) and almost never across nodes.

TP > 8 rarely makes sense. Since DGX/HGX nodes have 8 GPUs connected by NVLink, and cross-node links are 6–10× slower, TP degree is almost always capped at 8 (or the number of GPUs per node). Going beyond 8 means crossing the node boundary, which tanks throughput.

Attention heads map cleanly to TP. In multi-head attention, each head is an independent linear transform. With 32 heads and TP=8, each GPU computes 4 heads — a clean split with no cross-GPU dependencies until the final output projection. This is why TP maps naturally to transformers.

Sequence parallelism (SP). Standard TP splits weight matrices but replicates activation tensors — the input X is the same on every GPU. Sequence parallelism (introduced in Megatron-LM v3) extends TP to also split activations along the sequence dimension in the non-tensor-parallel regions (LayerNorm, dropout). This reduces activation memory by TP degree and is essentially free to implement if you already have TP. Always turn it on.

5. 3D Parallelism: Putting It Together

For models that are both wide and deep and need hundreds of GPUs, no single strategy works alone. 3D parallelism combines all three: TP within a node, PP across node groups, and DP (or FSDP) across the remaining dimension.

Figure 5 — 3D Parallelism (TP + PP + DP)

3D parallelism combines all three strategies: Tensor Parallel within a node (NVLink), Pipeline Parallel across node groups (InfiniBand), and Data Parallel for throughput scaling. The product of all three degrees equals the total GPU count.

Practical nuances

The decomposition recipe. The standard approach for clusters with 8-GPU nodes: TP = 8 (fill the node with tensor parallelism over NVLink), PP = total_nodes / DP (split the model depth across groups of nodes), DP = whatever’s left (for data-parallel throughput and gradient averaging). For a 128-GPU cluster (16 nodes) training a 70B model: TP=8, PP=4, DP=4 gives 8×4×4 = 128 GPUs.

Each dimension has a different cost. TP communication is frequent (every layer) but fast (NVLink). PP communication is rare (once per micro-batch per stage) but serializes the pipeline. DP communication is once per step but moves the full gradient volume. The goal is to minimize the bottleneck dimension. If your interconnect is slow, maximize PP (least bandwidth needed) and minimize TP.

Debugging is hard. A bug in 3D parallelism training could be caused by incorrect sharding (TP), activation mismatch between stages (PP), or gradient desync (DP). Each dimension introduces its own category of correctness issues. Start with TP only (single-node), add PP (multi-node), then add DP — don’t try to debug all three simultaneously.

When to Use What

Your situation	Start with	Why
Model fits on one GPU	DDP	Maximum simplicity, near-linear scaling
Model is 2–5× GPU memory	FSDP (ZeRO-3)	Shards everything, minimal code changes from DDP
Model is 10–50× GPU memory	FSDP + TP	TP reduces per-layer memory, FSDP handles the rest
Very deep model, slow interconnect	PP + DP	PP needs minimal bandwidth
Frontier model (100B+), fast cluster	3D (TP+PP+FSDP)	The only way to fit, the only way to be efficient
Inference at scale	TP (or TP+PP)	No gradients/optimizer — just split the weights

The heuristic: start with the simplest strategy that fits your model in memory. Only add complexity (more parallelism dimensions) when forced to by memory or throughput constraints. Every additional dimension introduces communication overhead, debugging surface area, and configuration tuning.

Memory Cheat Sheet

Training memory per parameter with mixed-precision Adam:

Component	Bytes/param	Sharded by
FP16 parameters	2	TP, PP
FP16 gradients	2	TP, PP, DP (FSDP)
FP32 master weights	4	DP (FSDP)
FP32 Adam momentum	4	DP (FSDP)
FP32 Adam variance	4	DP (FSDP)
Total	16	—

So a 70B model needs 70 × 16 = 1.12 TB of GPU memory for training (before activations). On 8× 80 GB GPUs (640 GB total), DDP can’t even fit the model. FSDP brings it to 1.12 TB / 8 = 140 GB total, which is 17.5 GB per GPU for model state, leaving plenty of room for activations.

Further reading: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.

Written on June 14, 2026