← Back to blog Demo 3 / 3

Load Balancing Simulator

Training config
Aux loss enabled iToggle this off, then hit Reset and re-run to see expert collapse in action. Without the auxiliary loss, one or two experts absorb nearly all traffic within a few hundred steps. Toggle it back on to see the correction.
Load-balancing penalty
Noisy gating
Gaussian noise for exploration
0
Training steps
0
Tokens routed
Aux loss
Collapsed
Token routing distribution per expert iEach bar shows what fraction of total tokens routed to that expert. Ideal = all bars equal height (1/N each). Teal = healthy range, coral = overloaded (dominant expert), grey = undertrained (barely used). Balance score: —
Load imbalance over training steps iCoefficient of variation of per-expert routing fractions: std_dev ÷ mean. Zero means perfectly balanced. Values above ~0.3 indicate significant collapse. Watch how quickly this rises without the auxiliary loss.
Expert quality report — training effectiveness per expert iTokens processed per expert since training started. "Overtrained" experts saw a disproportionate share of data and may overfit common patterns. "Undertrained" experts barely improved — they contribute little to model quality and waste parameter budget.

Why expert collapse happens

Early in training, random weight initialization causes some experts to produce slightly better outputs than others. The gating network, following gradient descent, learns to route more tokens to those experts. More tokens → more gradient signal → those experts improve further. It's a self-reinforcing cycle that quickly leads to collapse.

Without intervention, a model with 64 experts might end up with 2–3 experts handling 90%+ of all tokens.

The auxiliary loss fix

The load-balancing auxiliary loss adds a penalty term proportional to the product of how often each expert is selected (frequency) and the average routing probability it receives. High correlation between these two quantities — i.e., popular experts getting even higher probabilities — incurs a large penalty.

L_aux = α × N × Σᵢ fᵢ × pᵢ
fᵢ = fraction of tokens routed to i
pᵢ = mean routing prob for i
α = balance coefficient (try 0.01)