Mailman

1

Pick your categories

3 selected

2

Set your schedule

3

Read your summaries

completed2501.09781

Efficient Sparse Mixture-of-Experts for Low-Latency Inference

The proposed architecture replaces dense FFN layers with a top-2 gated mixture of 8 expert networks, each with dimensionality d_model/4. Token-to-expert assignment is computed via a softmax-normalized linear gate W_g, with a differentiable load-balancing term added to the routing logits.

Unlike Switch Transformer and GShard, this approach eliminates auxiliary balancing losses entirely. Instead, a soft capacity factor C=1.25 clips expert buffers, and overflow tokens are routed to a shared residual expert, ensuring no token is dropped during training or inference.

Benchmarked against a parameter-matched dense baseline (6.7B params, 1.8B active), the model achieves 73.2% on MMLU (vs 73.5% dense), 82.1% on HellaSwag (vs 82.3%), and 61.4% on ARC-C (vs 61.2%), with 3.8x lower FLOPs per forward pass.

arXiv papers, summarized and sent to your inbox.

Pick your categories

Set your schedule

Read your summaries