Demo 3 — MoE Load Balancing Simulator

Training config

Number of experts 8

Top-K routing 2

Tokens per step 32

Bias (expert 0 preference) 2.0 iSimulates the initial advantage one expert gets from random weight initialization. In real training, this imbalance self-amplifies: more tokens → more gradient signal → that expert improves more → even more tokens route to it.

Aux loss α 0.01 iThe balancing coefficient from the Switch Transformer paper: L_aux = α × N × Σ(f_i × p_i). Higher α = more aggressive correction but risks interfering with the main loss. The standard default of 0.01 works well for most architectures.

Aux loss enabled iToggle this off, then hit Reset and re-run to see expert collapse in action. Without the auxiliary loss, one or two experts absorb nearly all traffic within a few hundred steps. Toggle it back on to see the correction.

Load-balancing penalty

Noisy gating

Gaussian noise for exploration

Training steps

Tokens routed

—

Aux loss

Collapsed

Token routing distribution per expert iEach bar shows what fraction of total tokens routed to that expert. Ideal = all bars equal height (1/N each). Teal = healthy range, coral = overloaded (dominant expert), grey = undertrained (barely used). Balance score: —

Load imbalance over training steps iCoefficient of variation of per-expert routing fractions: std_dev ÷ mean. Zero means perfectly balanced. Values above ~0.3 indicate significant collapse. Watch how quickly this rises without the auxiliary loss.

Expert quality report — training effectiveness per expert iTokens processed per expert since training started. "Overtrained" experts saw a disproportionate share of data and may overfit common patterns. "Undertrained" experts barely improved — they contribute little to model quality and waste parameter budget.

Why expert collapse happens

Early in training, random weight initialization causes some experts to produce slightly better outputs than others. The gating network, following gradient descent, learns to route more tokens to those experts. More tokens → more gradient signal → those experts improve further. It's a self-reinforcing cycle that quickly leads to collapse.

Without intervention, a model with 64 experts might end up with 2–3 experts handling 90%+ of all tokens.

The auxiliary loss fix

The load-balancing auxiliary loss adds a penalty term proportional to the product of how often each expert is selected (frequency) and the average routing probability it receives. High correlation between these two quantities — i.e., popular experts getting even higher probabilities — incurs a large penalty.

L_aux = α × N × Σᵢ fᵢ × pᵢ
fᵢ = fraction of tokens routed to i
pᵢ = mean routing prob for i
α = balance coefficient (try 0.01)