Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat


Abstract
We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient small language models. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-k selection, enabling flexible specialization through \binom{62}{6} ≈ 3.6 × 107 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that with compression ratio r=d/2 achieves 68% KV cache memory reduction and 3.2× inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, improves the validation loss by 6.9% over the vanilla transformers while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2× inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural synergy, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.
Anthology ID:
2025.babylm-main.3
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42–51
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.3/
DOI:
Bibkey:
Cite (ACL):
Sushant Mehta, Raj Dandekar, Rajat Dandekar, and Sreedath Panat. 2025. Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models. In Proceedings of the First BabyLM Workshop, pages 42–51, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models (Mehta et al., BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.3.pdf