Figure 2 — Model Architecture

Figure 2 — Denoising transformer input signals

Token, position, and timestep embeddings are summed element-wise. The transformer uses bidirectional attention — no causal mask — so every position can attend to every other. The output head never predicts [MASK], only the 27 clean vocabulary tokens.