Why Muon Wins: An Interactive Demo
Muon is a recently proposed optimizer for neural networks that orthogonalizes gradients using Newton-Schulz iterations before applying updates. The intuition: instead of following the raw gradient (which is dominated by the largest singular directions), Muon normalizes all singular values to 1 โ giving equal update strength to every direction in weight space.
The demo below lets you watch four optimizers โ SGD, Adam, AdamW, and Muon โ compete on a simple matrix associative memory task. An 8ร8 weight matrix W must learn to map 8 key vectors to value vectors (think: a tiny transformer attention layer storing facts). The catch: training data follows a Zipf distribution. โCatโ appears ~45% of the time; โquokkaโ appears less than 1%.
This imbalance creates a skewed gradient dominated by a few large singular values. SGD and Adam chase that skew โ they learn common pairs well but fail on rare ones. Muonโs orthogonalization flattens the singular spectrum, so rare pairs get just as much gradient signal as common ones.
Hit Run and watch the per-pair error bars at the bottom. The rare pairs (quokka, sushi, mars) are where Muonโs advantage shows most clearly.
