2025
pdf
bib
abs
EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
Yuanteng Chen
|
Yuantian Shao
|
Peisong Wang
|
Jian Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.
pdf
bib
abs
Q-Mamba: Towards more efficient Mamba models via post-training quantization
Chen Tianqi
|
Yuanteng Chen
|
Peisong Wang
|
Weixiang Xu
|
Zeyu Zhu
|
Jian Cheng
Findings of the Association for Computational Linguistics: ACL 2025
State Space Models (SSMs), such as Mamba, have recently demonstrated potential in language understanding tasks, positioning them as competitors to transformer architectures. However, our investigations reveal that the Mamba architecture still has room for further optimization—not only in linear projections but also in state caches, which contribute significantly to memory consumption, particularly after quantizing the former into low bits. After a theoretical analysis of the causes of outliers in states, we propose Decoupled Scale Quantization (DSQ), which mitigates outliers in both the state and channel dimensions by applying separate quantization scales. To preserve the selective ability of quantized Mamba, we introduce Efficient Selectivity Reconstruction (ESR), a novel quantization simulation scheme in block-wise reconstruction that enables fast parallel scan algorithms with the non-linear quantization function. We demonstrate the effectiveness of Q-Mamba across various quantization settings, model sizes, and both generation and zero-shot tasks. In particular, for Mamba2-2.7B with W8A8H4 (8-bit weights and activations, 4-bit state caches) quantization, Q-Mamba achieves a 50% reduction in memory consumption with only a 2.13% average accuracy degradation on zero-shot tasks.
pdf
bib
abs
RQT: Hierarchical Residual Quantization for Multi-Model Compression
Chen Tianqi
|
Peisong Wang
|
Weixiang Xu
|
Zeyu Zhu
|
Jian Cheng
Findings of the Association for Computational Linguistics: ACL 2025
Delta compression methods focus on efficiently serving multiple uniquely fine-tuned models, each tailored to specific tasks and user requirements. These approaches decompose a fine-tuned LLM into a base model and corresponding delta weights, which are compressed using low-rank or low-bit representations to reduce storage costs. However, their effectiveness is highly sensitive to the magnitude of the model deltas—a factor directly influenced by the scale of the training data. We propose the Residual Quantization Tree (RQT), a hierarchical quantization framework that automatically shares low-bit integer weights across similar fine-tuned models. The RQT construction employs a two-phase greedy algorithm: a bottom-up aggregation of models based on weight matrix similarity, and top-down residual quantization, in which each node optimizes the quantization parameters and then delegates residual errors to child nodes. We evaluate RQT on fine-tuned models across mathematics, coding, chatbot, and Chinese LLMs. The results show that RQT achieves an average accuracy degradation of approximately 3% (comparable to previous 4-bit post-training quantization) while maintaining an effective bitwidth of around 2 bits.
2021
pdf
bib
EBERT: Efficient BERT Inference with Dynamic Structured Pruning
Zejian Liu
|
Fanrong Li
|
Gang Li
|
Jian Cheng
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2019
pdf
bib
abs
Overcoming the bottleneck in traditional assessments of verbal memory: Modeling human ratings and classifying clinical group membership
Chelsea Chandler
|
Peter W. Foltz
|
Jian Cheng
|
Jared C. Bernstein
|
Elizabeth P. Rosenfeld
|
Alex S. Cohen
|
Terje B. Holmlund
|
Brita Elvevåg
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology
Verbal memory is affected by numerous clinical conditions and most neuropsychological and clinical examinations evaluate it. However, a bottleneck exists in such endeavors because traditional methods require expert human review, and usually only a couple of test versions exist, thus limiting the frequency of administration and clinical applications. The present study overcomes this bottleneck by automating the administration, transcription, analysis and scoring of story recall. A large group of healthy participants (n = 120) and patients with mental illness (n = 105) interacted with a mobile application that administered a wide range of assessments, including verbal memory. The resulting speech generated by participants when retelling stories from the memory task was transcribed using automatic speech recognition tools, which was compared with human transcriptions (overall word error rate = 21%). An assortment of surface-level and semantic language-based features were extracted from the verbal recalls. A final set of three features were used to both predict expert human ratings with a ridge regression model (r = 0.88) and to differentiate patients from healthy individuals with an ensemble of logistic regression classifiers (accuracy = 76%). This is the first ‘outside of the laboratory’ study to showcase the viability of the complete pipeline of automated assessment of verbal memory in naturalistic settings.
2015
pdf
bib
Identifying Patterns For Short Answer Scoring Using Graph-based Lexico-Semantic Text Matching
Lakshmi Ramachandran
|
Jian Cheng
|
Peter Foltz
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications
2014
pdf
bib
Automatic Assessment of the Speech of Young English Learners
Jian Cheng
|
Yuan Zhao D’Antilio
|
Xin Chen
|
Jared Bernstein
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
pdf
bib
Syllable and language model based features for detecting non-scorable tests in spoken language proficiency assessment applications
Angeliki Metallinou
|
Jian Cheng
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
2011
pdf
bib
Performance of Automated Scoring for Children’s Oral Reading
Ryan Downey
|
David Rubin
|
Jian Cheng
|
Jared Bernstein
Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications
2009
pdf
bib
Automated Assessment of Spoken Modern Standard Arabic
Jian Cheng
|
Jared Bernstein
|
Ulrike Pado
|
Masanori Suzuki
Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications