A Practitioner's Guide to Distributed Training Parallelism
Training a large model isn’t hard because the math is complicated. It’s hard because the model doesn’t fit. A 70-billion-parameter transformer needs roughly 140 GB just to store its fp16 weights — and training requires 4× that for gradients and optimizer states. An 80 GB GPU can’t hold it. So you split the work.
