← Back to blog Demo 3 / 3

Load Balancing Simulator

Training config
Aux loss enabled
Load-balancing penalty
Noisy gating
Gaussian noise for exploration
0
Training steps
0
Tokens routed
Aux loss
Collapsed
Token routing distribution per expert Balance score: —
Load imbalance over training steps
Expert quality report — training effectiveness per expert

Why expert collapse happens

Early in training, random weight initialization causes some experts to produce slightly better outputs than others. The gating network, following gradient descent, learns to route more tokens to those experts. More tokens → more gradient signal → those experts improve further. It's a self-reinforcing cycle that quickly leads to collapse.

Without intervention, a model with 64 experts might end up with 2–3 experts handling 90%+ of all tokens.

The auxiliary loss fix

The load-balancing auxiliary loss adds a penalty term proportional to the product of how often each expert is selected (frequency) and the average routing probability it receives. High correlation between these two quantities — i.e., popular experts getting even higher probabilities — incurs a large penalty.

L_aux = α × N × Σᵢ fᵢ × pᵢ
fᵢ = fraction of tokens routed to i
pᵢ = mean routing prob for i
α = balance coefficient (try 0.01)