On the Benefits of Learning to Route in Mixture-of-Experts Models

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, Xin Wang


Abstract
Mixture-of-Expert (MoE) Transformer models, such as the Switch Transformer, allow us to successfully scale up model sizes while keeping the amount of compute time fixed. Prior work has established the computational efficiency benefits of using these models. A core component of these models is a router that routes input tokens to different experts in a layer. We show theoretical and empirical evidence that the router’s ability to route tokens intelligently confers a significant advantage to MoE models. We study synthetic settings where the input data is distributed in clusters and show theoretically and empirically that the router learns to route the inputs according to these clusters. Then we perform experiments on real data using the T5X library, where we observe that a trainable router confers a non-trivial benefit instead of a non-trainable router.
Anthology ID:
2023.emnlp-main.583
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9376–9396
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.583/
DOI:
10.18653/v1/2023.emnlp-main.583
Bibkey:
Cite (ACL):
Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. 2023. On the Benefits of Learning to Route in Mixture-of-Experts Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9376–9396, Singapore. Association for Computational Linguistics.
Cite (Informal):
On the Benefits of Learning to Route in Mixture-of-Experts Models (Dikkala et al., EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.583.pdf