Demo 1 — MoE Gating Network Visualizer

Input token

💻

def sort(arr):

Code / syntax

∫

∫ f(x) dx

Mathematical

🌐

Bonjour

Multilingual

🔗

If A then B

Logical

📖

The capital of

Factual / entity

Routing config

Top-K experts activated 2

Noise temperature 0.0

What am I seeing?
The gating network is a single linear layer that maps the token embedding to N logit scores — one per expert. After TopK masking and softmax, the top experts receive the token with weighted contributions.

Input

Token embedding

def sort(arr):

768-dim vector (simplified)

W_g · x

Gating network

Gate

Linear → TopK → Softmax

H = W_g · x + noise
G = softmax(topK(H, 2))

weights

Experts

Routing probabilities — raw logits vs. softmax weights

Expert

Weight (softmax)

Logit

Prob

Output formula

Output = w₁·E₁(x) + w₂·E₂(x)

Active experts: