arXiv papers, summarized and sent to your inbox.
Pick your categories
Set your schedule
Read your summaries
The proposed architecture replaces dense FFN layers with a top-2 gated mixture of 8 expert networks, each with dimensionality d_model/4. Token-to-expert assignment is computed via a softmax-normalized linear gate W_g, with a differentiable load-balancing term added to the routing logits.
Unlike Switch Transformer and GShard, this approach eliminates auxiliary balancing losses entirely. Instead, a soft capacity factor C=1.25 clips expert buffers, and overflow tokens are routed to a shared residual expert, ensuring no token is dropped during training or inference.
Benchmarked against a parameter-matched dense baseline (6.7B params, 1.8B active), the model achieves 73.2% on MMLU (vs 73.5% dense), 82.1% on HellaSwag (vs 82.3%), and 61.4% on ARC-C (vs 61.2%), with 3.8x lower FLOPs per forward pass.