Figure 2 — Denoising transformer input signals
Three embeddings: token, position, timestep summed into transformer encoder Token embedding tok_emb(x_t) What token is here? Position embedding pos_emb(pos) Where in sequence? Timestep embedding t_emb(t) How noisy is it? + + element-wise sum Transformer encoder 2 layers · 4 heads · d=64 · bidirectional Linear head logits → (B, L, 27 clean tokens) No causal mask ←→

Token, position, and timestep embeddings are summed element-wise. The transformer uses bidirectional attention — no causal mask — so every position can attend to every other. The output head never predicts [MASK], only the 27 clean vocabulary tokens.