A Practitioner's Guide to Distributed Training Parallelism
Training a large model isn’t hard because the math is complicated. It’s hard because the model doesn’t fit. A 70B transformer needs ~140 GB just to store its fp16 weights — add gradients and optimizer state and you’re at 1.12 TB. An 80 GB A100 can’t hold it. So you split the work across GPUs.
