Optimizers 101: Visualizing SGD, Adam, and Muon — and Building the Toy Problem Muon Actually Needs
I recently went down an optimizer rabbit hole. It started innocently — I wanted to build one of those classic Alec Radford-style contour visualizations where you watch SGD, Adam, and friends race toward a minimum. By the end, I’d built two completely different demos, read a dozen papers, and learned that the most important new optimizer in deep learning literally cannot show its advantage on the toy problems we’ve been using for a decade.
