From Next-Token Prediction to Iterative Unmasking: A Visual Guide to Text Diffusion
GPT writes left to right, one token at a time, no going back. Gemma Diffusion starts from a fully masked canvas and thinks its way to fluency all at once. Same transformer backbone, completely different bet about what generation should look like. This post visualizes every step of the idea, grounded in a 200-line PyTorch implementation you can run yourself.
