Jiamin Li
2025
StitchLLM: Serving LLMs, One Block at a Time
Bodun Hu
|
Shuozhe Li
|
Saurabh Agarwal
|
Myungjin Lee
|
Akshay Jajoo
|
Jiamin Li
|
Le Xu
|
Geon-Woo Kim
|
Donghyun Kim
|
Hong Xu
|
Amy Zhang
|
Aditya Akella
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid evolution of large language models (LLMs) has revolutionized natural language processing (NLP) tasks such as text generation, translation, and comprehension. However, the increasing computational demands and inference costs of these models present significant challenges. This study investigates the dynamic and efficient utilization of pre-trained weights from open-sourced LLMs of varying parameter sizes to achieve an optimal balance between computational efficiency and task performance. Drawing inspiration from the dual-process theory of human cognition, we introduce StitchLLM: a dynamic model routing framework that employs a powerful bottom model to process all queries, and uses a lightweight routing mechanism to allocate computational resources appropriately. Our novel framework optimizes efficiency and maintains performance, leveraging a trainable stitching layer for seamless integration of decoder layers across different LLMs. Experimental results demonstrate that StitchLLM improves system throughput while minimizing performance degradation, offering a flexible solution for deploying LLMs in resource-constrained settings.
2023
Adaptive Gating in Mixture-of-Experts based Language Models
Jiamin Li
|
Qiang Su
|
Yitao Yang
|
Yimin Jiang
|
Cong Wang
|
Hong Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Large language models have demonstrated exceptional language understanding capabilities in many NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE models adopt a fixed gating network where each token is computed by the same number of experts. This contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. Adaptive gating preserves sparsity while improving training efficiency. We further draw upon curriculum learning to better align the order of training samples and maximize the training time savings. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the gating decisions and present our insights on which tokens are inherently difficult to process, depending on the specific language task.
Search
Fix author
Co-authors
- Hong Xu 2
- Saurabh Agarwal 1
- Aditya Akella 1
- Bodun Hu 1
- Akshay Jajoo 1
- show all...