Demo 2 — MoE Sparse Activation Explorer

Configuration

Total experts (N) 64 iThe total number of expert FFNs in this layer. All N experts must be loaded into GPU memory, but only k fire per token. Real models range from 8 experts (Switch Transformer) to 256+.

Active per token (k) 2 iHow many experts compute per token. This directly sets the per-token FLOPs — k expert FFNs run regardless of N. The ratio k/N is the activation rate; lower means more compute savings but less capacity used per token.

Quick presets

Live metrics iThese four numbers capture the core MoE efficiency tradeoff. Activation rate and FLOPs saved scale with k/N. Capacity/compute ratio = N/k. Memory footprint stays fixed at N regardless of k.

Expert pool — active experts highlighted per token iEach cell is one expert. Purple = active for this token; grey = dormant. Every token independently routes to its own top-k subset — the grid updates with each step of the animation. Click any cell to identify it by index.

Grid view:

Click any cell to see its index

Active

Idle

Token routing simulation iEach incoming token is independently routed to k experts by the gating network. Different tokens typically activate different experts — no single expert sees every token, which is why all-to-all communication is needed on real multi-GPU hardware.

Incoming tokens

Click "Animate token stream" to route tokens

Compute & memory comparison vs. equivalent dense model iCompute (FLOPs) scales with k — only active experts run. Memory stays fixed at N — all experts must be loaded into VRAM. This asymmetry is the central MoE hardware constraint: you pay full memory cost to get partial compute cost.