Continuous vs. discrete diffusion — what the model actually predicts
Image diffusion (continuous)
Clean signal
add Gaussian noise ε
x_t = x₀ + ε
Corrupted (model input)
model forward pass
Model outputs ε̂ — noise estimate
continuous correction, same shape as input
subtract the estimated noise
xt-1 = xt − α · ε̂
Predicts what was ADDED
Text diffusion (discrete)
Clean tokens
mask each token w/ prob t/T
x_t = x₀ masked at rate t/T
Masked (model input)
model forward pass
Model outputs p(token) per [MASK]
probability over vocabulary, per masked position
reveal highest-confidence tokens
reveal = argmax p(x₀ | x_t)
Predicts what SHOULD BE THERE

Standard diffusion operates in continuous space and predicts a correction vector — the noise to subtract. Text diffusion operates in discrete space: there is nothing to subtract from a token ID. Instead the model predicts a probability distribution over the clean vocabulary at each masked position, then reveals tokens by confidence. Same high-level intuition, completely different mathematical objective.