Input
Token embedding
def sort(arr):
768-dim vector (simplified)
W_g ยท x
Gating network
Gate
Linear โ TopK โ Softmax
H = W_g ยท x + noise
G = softmax(topK(H, 2))
G = softmax(topK(H, 2))
weights
Experts
Routing probabilities โ raw logits vs. softmax weights
Expert
Weight (softmax)
Logit
Prob
Output formula
Output = wโยทEโ(x) + wโยทEโ(x)
Active experts: