Figure — Continuous vs. Discrete Diffusion

Continuous vs. discrete diffusion — what the model actually predicts

Image diffusion (continuous)

Clean signal

↓ add Gaussian noise ε
x_t = x₀ + ε

Corrupted (model input)

↓model forward pass

Model outputs ε̂ — noise estimate

continuous correction, same shape as input

↓ subtract the estimated noise

x_t-1 = x_t − α · ε̂

Predicts what was ADDED

Text diffusion (discrete)

Clean tokens

↓ mask each token w/ prob t/T
x_t = x₀ masked at rate t/T

Masked (model input)

↓model forward pass

Model outputs p(token) per [MASK]

probability over vocabulary, per masked position

↓ reveal highest-confidence tokens

reveal = argmax p(x₀ | x_t)

Predicts what SHOULD BE THERE

Standard diffusion operates in continuous space and predicts a correction vector — the noise to subtract. Text diffusion operates in discrete space: there is nothing to subtract from a token ID. Instead the model predicts a probability distribution over the clean vocabulary at each masked position, then reveals tokens by confidence. Same high-level intuition, completely different mathematical objective.