Why Muon Wins: An Interactive Demo

Muon is a recently proposed optimizer for neural networks that orthogonalizes gradients using Newton-Schulz iterations before applying updates. The intuition: instead of following the raw gradient (which is dominated by the largest singular directions), Muon normalizes all singular values to 1 โ€” giving equal update strength to every direction in weight space.

The demo below lets you watch four optimizers โ€” SGD, Adam, AdamW, and Muon โ€” compete on a simple matrix associative memory task. An 8ร—8 weight matrix W must learn to map 8 key vectors to value vectors (think: a tiny transformer attention layer storing facts). The catch: training data follows a Zipf distribution. โ€œCatโ€ appears ~45% of the time; โ€œquokkaโ€ appears less than 1%.

This imbalance creates a skewed gradient dominated by a few large singular values. SGD and Adam chase that skew โ€” they learn common pairs well but fail on rare ones. Muonโ€™s orthogonalization flattens the singular spectrum, so rare pairs get just as much gradient signal as common ones.

Hit Run and watch the per-pair error bars at the bottom. The rare pairs (quokka, sushi, mars) are where Muonโ€™s advantage shows most clearly.

Written on April 6, 2026