Demo 3 — MoE Load Balancing Simulator

Training config

Number of experts 8

Top-K routing 2

Tokens per step 32

Bias (expert 0 preference) 2.0

Aux loss α 0.01

Aux loss enabled

Load-balancing penalty

Noisy gating

Gaussian noise for exploration

Training steps

Tokens routed

—

Aux loss

Collapsed

Token routing distribution per expert Balance score: —

Load imbalance over training steps

Why expert collapse happens

Early in training, random weight initialization causes some experts to produce slightly better outputs than others. The gating network, following gradient descent, learns to route more tokens to those experts. More tokens → more gradient signal → those experts improve further. It's a self-reinforcing cycle that quickly leads to collapse.

Without intervention, a model with 64 experts might end up with 2–3 experts handling 90%+ of all tokens.

The auxiliary loss fix

The load-balancing auxiliary loss adds a penalty term proportional to the product of how often each expert is selected (frequency) and the average routing probability it receives. High correlation between these two quantities — i.e., popular experts getting even higher probabilities — incurs a large penalty.

L_aux = α × N × Σᵢ fᵢ × pᵢ
fᵢ = fraction of tokens routed to i
pᵢ = mean routing prob for i
α = balance coefficient (try 0.01)