Architecture
Sparse Mixture of Experts
Quick Answer
A MoE variant where only a small number of experts activate per token.
Sparse MoE activates only k experts (e.g., 2-8) out of many (e.g., 64) per token. This dramatically reduces compute while maintaining parameter count. Sparse MoE enables models with far more total parameters than could be trained densely. The router must learn to route tokens effectively. Load balancing (ensuring tokens are distributed across experts) is crucial. Sparse MoE enables efficient scaling. Inference requires less compute than dense models of equivalent quality.
Last verified: 2026-04-08