โ† Back to blog Demo 1 / 3

Gating Network Visualizer

Token iSelect an input type to see how routing changes. Code tokens strongly activate the syntax expert; math tokens pivot to the math expert. The gating network learns these preferences from training data patterns.
Top-K 2 iHow many experts activate per token. k=1 (Switch Transformer) is maximally efficient but can be unstable. k=2 (Mixtral) is the standard production choice โ€” a good balance of compute savings and routing stability.
Noise 0.0 iAdds Gaussian noise to gate logits before TopK selection. This "noisy top-k" trick (from Shazeer et al. 2017) encourages the gate to explore different experts during training, preventing it from locking onto the same favorites every step.
Input iThe token's embedding vector โ€” a dense numerical representation encoding its meaning and context. These same values are fed to both the gating network (to decide routing) and the active experts (to compute on).
Token embedding
def sort(arr):
768-dim vector (simplified)
W_g ยท x
Gating network iA single weight matrix W_g multiplied by the embedding produces one score (logit) per expert. TopK zeroes out all but the top k scores; softmax converts the survivors into probabilities that sum to 1. The entire gate adds negligible compute overhead.
Gate
Linear โ†’ TopK โ†’ Softmax
H = W_g ยท x + noise
G = softmax(topK(H, 2))
weights
Experts iOnly the top-k experts receive the token and run their feed-forward computation. The rest stay completely dormant โ€” no matrix multiplications, no memory bandwidth used. Active experts run in parallel and their outputs are blended by the routing weights.

Routing probabilities โ€” raw logits vs. softmax weights iRaw logits are the gate scores before normalization โ€” higher = stronger routing preference. After TopK masking, softmax converts surviving logits to probabilities. Non-selected experts drop to exactly 0% because they're excluded before softmax, not merely down-weighted.

Expert
Weight (softmax)
Logit
Prob
Output formula iThe final token representation is a weighted sum: each active expert's output vector multiplied by its routing probability, then summed. The result is a single vector with the same shape as the input, passed to the next transformer layer.
Output = wโ‚ยทEโ‚(x) + wโ‚‚ยทEโ‚‚(x)
Active experts: