@inproceedings{mehta-etal-2025-unifying,
title = "Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models",
author = "Mehta, Sushant and
Dandekar, Raj and
Dandekar, Rajat and
Panat, Sreedath",
editor = "Charpentier, Lucas and
Choshen, Leshem and
Cotterell, Ryan and
Gul, Mustafa Omer and
Hu, Michael Y. and
Liu, Jing and
Jumelet, Jaap and
Linzen, Tal and
Mueller, Aaron and
Ross, Candace and
Shah, Raj Sanjay and
Warstadt, Alex and
Wilcox, Ethan Gotlieb and
Williams, Adina",
booktitle = "Proceedings of the First BabyLM Workshop",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.3/",
pages = "42--51",
ISBN = "TODO",
abstract = "We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient small language models. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-$k$ selection, enabling flexible specialization through $\binom{62}{6} \approx 3.6 \times 10^{7}$ possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that with compression ratio $r=d/2$ achieves 68{\%} KV cache memory reduction and 3.2{\texttimes} inference speedup while maintaining competitive perplexity (0.8{\%} degradation). Compared to the parameters with 53.9M parameters, improves the validation loss by 6.9{\%} over the vanilla transformers while using 42{\%} fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1{\%} improvement with 3.2{\texttimes} inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural synergy, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment."
}Markdown (Informal)
[Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models](https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.3/) (Mehta et al., BabyLM 2025)
ACL