Token, position, and timestep embeddings are summed element-wise. The transformer uses
bidirectional attention — no causal mask — so every position can attend to every other.
The output head never predicts [MASK], only the 27 clean vocabulary tokens.