Mixture of Experts: How AI Models Scale Without Going Broke

A visual deep-dive into sparse activation, gating networks, and the architecture powering modern large language models

The core idea

A standard dense neural network activates all of its parameters for every single input. That works fine at small scale, but once you’re dealing with hundreds of billions of parameters, it becomes ruinously expensive — every forward pass touches everything.

Mixture of Experts (MoE) breaks this constraint. Instead of one monolithic network, you build N specialist sub-networks (the “experts”), plus a lightweight gating network that decides which 1 or 2 experts to actually use for each token. The rest stay dormant, saving compute.

The key insight: You can have a model with 256 billion total parameters but only activate ~8 billion for any given token. You get the knowledge capacity of a huge model at the inference cost of a much smaller one.

A single MoE layer forward pass. The gating network scores all experts but only the top-k=2 fire. Dormant experts incur zero compute while still contributing to total model capacity.

Token x iThe input token's embedding vector — a dense numerical representation of its meaning. This same vector is fed to both the gating network (to decide which experts activate) and the active experts themselves (to compute on). Gating Network iA single learned weight matrix W_g maps the token embedding to N scores (logits) — one per expert. TopK keeps only the k highest; softmax converts them to routing probabilities that sum to 1. The gate adds negligible compute overhead. Active experts iOnly the top-k experts (teal border) receive the token and run their full feed-forward computation. The routing weights (0.72, 0.28) show each expert's contribution to the final output. Dormant experts contribute exactly zero for this token. Output y iA weighted sum of the active experts' output vectors: y = 0.72·E1(x) + 0.28·E4(x). Same shape as the input — the next transformer layer receives this as if it came from a single network, with no knowledge of the routing that produced it.

Demo 1: The Gating Network

To make routing concrete, we built a minimal MoE layer that runs the actual gating math live in the browser. It implements a weight matrix W_g that maps five token types to per-expert logit scores, applies TopK masking, and normalizes with softmax — recomputed on every interaction with no precomputed states. The five token types (code, math, language, logic, factual) have distinct logit profiles that reflect what a trained gate would learn from real data. The noise slider adds Gaussian jitter before TopK selection, simulating the noisy top-k trick used during training to prevent the gate from locking onto the same experts every step.

Feed different token types (code, math, multilingual, logical, factual) into a simplified MoE layer and watch the routing weights update live. Notice how def sort(arr): reliably activates the code expert with ~72% weight, while ∫ f(x) dx pivots strongly toward the math expert.

How it works, step by step

1. Token arrives as an embedding

Each input token is converted to a high-dimensional vector. This vector carries the token’s meaning and context from previous layers — it’s the same embedding used in any transformer. What comes next is what makes MoE different.

2. The gating network scores each expert

A small linear layer W_g is multiplied by the token embedding to produce one logit per expert. If you have 64 experts, you get a vector of 64 scores. The gating network is tiny — just one weight matrix — so it adds negligible overhead.

3. TopK selects the winners

Only the top-k logit positions are kept (typically k=1 or k=2). All other logits are set to −∞ before the softmax step. This is the sparsity mechanism: a hard selection that routes the token to just a few experts and skips the rest entirely.

4. Softmax converts logits to weights

The remaining k logits pass through softmax to produce routing probabilities that sum to 1.0. If k=2 and the scores are 0.78 and 0.22, the token will be processed by both experts with those proportional contributions.

5. Experts compute in parallel

The selected experts — each a full feed-forward network, typically a large MLP — process the token representation independently and in parallel. On real hardware this dispatch is done via all-to-all communication across GPU nodes.

6. Weighted sum produces the output

Expert outputs are multiplied by their routing weights and summed:

Output = 0.78 · E₁(x) + 0.22 · E₂(x)

The result is a single vector with the same shape as the input, ready for the next transformer layer. The whole pipeline is fully differentiable; gradients flow through the gating weights and both active experts.

The math

The gating function is elegantly simple. For an input x and expert networks E₁…Eₙ:

\[G(x) = \text{Softmax}\!\left(\text{TopK}(W_g \cdot x,\ k)\right)\] \[\text{Output} = \sum_i G(x)_i \cdot E_i(x)\]

Where W_g is the learned gate weight matrix, TopK selects only the k highest logits (setting the rest to −∞), and the output is a weighted sum of only the active experts.

Some implementations add Gaussian noise before TopK to encourage exploration during training:

\[H(x)_i = (W_g \cdot x)_i + \mathcal{N}(0,1) \cdot \text{Softplus}\!\left((W_\text{noise} \cdot x)_i\right)\]

This noise prevents the gating network from converging too quickly on a fixed set of favorites.

Demo 2: Sparse Activation Explorer

To show how the efficiency math plays out at scale, we built a configurable expert pool that models any combination of N total experts and k active per token. It computes the exact activation rate, FLOP savings, and capacity-to-compute ratio from first principles — no approximations. The expert grid renders the full pool as a cell array and lights up whichever experts fire for each token. Hit “Animate token stream” to watch a sequence of tokens route through the layer: you’ll see a different sparse subset activate each time, distributed across the pool rather than concentrating on the same cells.

Drag the top-k slider from 1 to 16 across a 64-expert pool. Watch how the activation rate, compute savings, and effective capacity ratio change. At Top-2 you’re activating just 3.1% of experts — using 96.9% less expert compute than a fully dense model.

The load balancing problem

Here’s the catch: if the gating network learns to always route to the same 2–3 “popular” experts, the others never get trained. You end up with a model that’s effectively much smaller than its parameter count suggests. This is called expert collapse.

The fix is an auxiliary load-balancing loss added to the training objective. It penalizes routing distributions where some experts receive many more tokens than others, nudging the gating network toward even coverage.

Google’s Switch Transformer defines the load balancing loss as:

\[\mathcal{L}_\text{aux} = \alpha \cdot N \cdot \sum_i f_i \cdot p_i\]

Where $f_i$ is the fraction of tokens dispatched to expert $i$, $p_i$ is the average routing probability for expert $i$, $N$ is the number of experts, and $\alpha$ is a small hyperparameter (typically 0.01).

Demo 3: Load Balancing Simulator

To make collapse tangible, we built a training simulator that runs the actual feedback loop. At each step, a batch of tokens is routed to experts via softmax(logits); experts that receive more tokens accumulate stronger gradient signal, nudging their logit scores upward and attracting even more tokens next step. The auxiliary loss counters this by penalizing the product of routing frequency and routing probability per expert — exactly the Switch Transformer formulation. The simulation runs at roughly 8 steps per second, fast enough to watch collapse develop within seconds of disabling the loss, and to see the correction kick in when you re-enable it. The expert quality cards at the bottom track cumulative token counts, so the training history is visible even after you pause.

Toggle the auxiliary loss on and off during a simulated training run. Without it, two or three experts quickly absorb nearly all the routing traffic. With it, the distribution stays healthy. You can also adjust the loss coefficient α to see how aggressively it corrects imbalance.

Sparse vs. dense: the tradeoffs

Dimension	Dense model	MoE model
Compute
Training cost	✓ Lower for same parameter count	↑ Higher — all experts need gradient signal
Inference compute	Full params touched every token	✓ Only active experts — 2–8× cheaper
Memory
VRAM footprint	✓ Matches active parameter count	↑ Must load all N experts into VRAM
Routing
Routing overhead	— None	≈ Tiny — gating matrix only
Expert collapse	— Not applicable	⚠ Real risk without load-balancing loss
Hardware
Distributed complexity	✓ Simple all-reduce	↑ Expert parallelism + all-to-all comms

The practical upshot: MoE models are memory-hungry but compute-efficient. They’re best suited for high-throughput inference environments where you have lots of VRAM but want fast per-token latency.

Expert parallelism: the hardware reality

The “expert parallelism + all-to-all comms” row in the table above is where MoE earns its reputation for being hard to serve. In practice, a model with 64 experts can’t fit all of them on a single GPU, so experts are sharded across a group of GPUs — each device owns a subset. When a batch of tokens arrives, the gating network decides which expert each token goes to, then the system must physically move each token to the GPU that owns its assigned expert. This is the all-to-all step: every GPU sends some tokens to every other GPU before computation can begin, and then the results come back the same way.

That communication cost is the main reason you don’t just swap a dense model for a MoE of the same active-parameter count and call it free. At scale, all-to-all across hundreds of GPUs becomes a meaningful fraction of step time — enough that DeepSeek-V3 describes minimizing it as a core design constraint, and why most MoE deployments require specialized parallelism libraries (Megablocks, DeepSpeed-MoE, or custom kernels) rather than off-the-shelf training loops. The routing problem and the communication problem are inseparable once you leave a single device.

Where you’ll find MoE in the wild

Switch Transformer (Google, 2021) — The paper that proved MoE could scale to trillions of parameters. Used Top-1 routing for simplicity and showed that even with the instability it introduces, the efficiency gains were worth it.

Mixtral 8×7B (Mistral AI, 2023) — The most prominent open-weight MoE model. 8 experts per layer, Top-2 routing, ~13B active parameters out of 46B total. Performance competitive with dense models at 2–3× the size.

GPT-4 (OpenAI, 2023) — Widely believed to use MoE architecture based on leaks and inference from its efficiency characteristics. OpenAI has never confirmed the architecture details.

DeepSeek-V3 (DeepSeek, 2024) — Uses fine-grained expert segmentation and an auxiliary-loss-free load balancing strategy. 671B total parameters, 37B active per token, trained at remarkably low cost.

Key takeaways

MoE decouples capacity from compute. Total parameters determine what the model knows; active parameters determine how much it thinks per token.
The gating network is the heart of the system. A small linear layer + TopK + softmax is all it takes to route tokens to specialists. The elegance is in the simplicity.
Load balancing is the central training challenge. Expert collapse is real and subtle — the auxiliary loss is not optional in practice.
Memory is the bottleneck, not compute. All experts must be loaded into VRAM even though most are idle at any moment. This shapes what hardware MoE models run well on.
MoE is increasingly the default at scale. The efficiency gains are too significant to ignore once model size crosses ~10B parameters. Expect it everywhere.

Harsh Agarwal