Early in training, random weight initialization causes some experts to produce slightly better outputs than others. The gating network, following gradient descent, learns to route more tokens to those experts. More tokens → more gradient signal → those experts improve further. It's a self-reinforcing cycle that quickly leads to collapse.
Without intervention, a model with 64 experts might end up with 2–3 experts handling 90%+ of all tokens.
The load-balancing auxiliary loss adds a penalty term proportional to the product of how often each expert is selected (frequency) and the average routing probability it receives. High correlation between these two quantities — i.e., popular experts getting even higher probabilities — incurs a large penalty.