Jian Cheng

This is an internal, incomplete preview of a proposed change to the ACL Anthology. For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes. Do not treat this content as an official publication.

2025

pdf bib abs
EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
Yuanteng Chen | Yuantian Shao | Peisong Wang | Jian Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.

State Space Models (SSMs), such as Mamba, have recently demonstrated potential in language understanding tasks, positioning them as competitors to transformer architectures. However, our investigations reveal that the Mamba architecture still has room for further optimization—not only in linear projections but also in state caches, which contribute significantly to memory consumption, particularly after quantizing the former into low bits. After a theoretical analysis of the causes of outliers in states, we propose Decoupled Scale Quantization (DSQ), which mitigates outliers in both the state and channel dimensions by applying separate quantization scales. To preserve the selective ability of quantized Mamba, we introduce Efficient Selectivity Reconstruction (ESR), a novel quantization simulation scheme in block-wise reconstruction that enables fast parallel scan algorithms with the non-linear quantization function. We demonstrate the effectiveness of Q-Mamba across various quantization settings, model sizes, and both generation and zero-shot tasks. In particular, for Mamba2-2.7B with W8A8H4 (8-bit weights and activations, 4-bit state caches) quantization, Q-Mamba achieves a 50% reduction in memory consumption with only a 2.13% average accuracy degradation on zero-shot tasks.

pdf bib abs
RQT: Hierarchical Residual Quantization for Multi-Model Compression
Chen Tianqi | Peisong Wang | Weixiang Xu | Zeyu Zhu | Jian Cheng
Findings of the Association for Computational Linguistics: ACL 2025

Delta compression methods focus on efficiently serving multiple uniquely fine-tuned models, each tailored to specific tasks and user requirements. These approaches decompose a fine-tuned LLM into a base model and corresponding delta weights, which are compressed using low-rank or low-bit representations to reduce storage costs. However, their effectiveness is highly sensitive to the magnitude of the model deltas—a factor directly influenced by the scale of the training data. We propose the Residual Quantization Tree (RQT), a hierarchical quantization framework that automatically shares low-bit integer weights across similar fine-tuned models. The RQT construction employs a two-phase greedy algorithm: a bottom-up aggregation of models based on weight matrix similarity, and top-down residual quantization, in which each node optimizes the quantization parameters and then delegates residual errors to child nodes. We evaluate RQT on fine-tuned models across mathematics, coding, chatbot, and Chinese LLMs. The results show that RQT achieves an average accuracy degradation of approximately 3% (comparable to previous 4-bit post-training quantization) while maintaining an effective bitwidth of around 2 bits.

As the parameter size of language models becomes extremely large, fine-tuning them with limited resources has become a challenging task. Latest advancements in parameter-efficient fine-tuning (PEFT) techniques allow for adjustments to only a minor fraction of the parameters of these LLMs. Yet, most of PEFT methods may suffer from the following limitations: (1) As the rank decreases sharply, PEFT methods like LoRA and Adapter tuning will exhibit significant performance degradation in downstream tasks. (2) An accuracy gap between these methods and full fine-tuning (Full-FT) still exists. To tackle these problems, we propose a Low-Rank Direct Attention Adaptation (LoRaDA) method for efficient LLM fine-tuning. Specifically, we introduce a novel Low-rank Multi-head Attention Map Module (LMAM), which can bring negative attention to self-attention modules and learn low-rank attention weights directly, capturing the characteristics of downstream tasks. Furthermore, LMAM can serve as a plug-in to existing methods, such as LoRA and Adapter, providing state-of-the-art performance even with extreme low rank setting.Extensive experiments on various downstream tasks demonstrate the superior performance of our LoRaDA method. Specifically, LoRaDA even outperforms the full fine-tuning method by up to 2.1% on GLUE benchmark. As a plug-in, LMAM boosts the accuracy of LoRA by up to 27.7% with LLaMA-7B on Commonsense Reasoning benchmark.

Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt—a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using LLMs. The code and dataset are available at https://github.com/samwu-learn/Step.

2021

pdf bib
EBERT: Efficient BERT Inference with Dynamic Structured Pruning
Zejian Liu | Fanrong Li | Gang Li | Jian Cheng
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf bib abs
Overcoming the bottleneck in traditional assessments of verbal memory: Modeling human ratings and classifying clinical group membership
Chelsea Chandler | Peter W. Foltz | Jian Cheng | Jared C. Bernstein | Elizabeth P. Rosenfeld | Alex S. Cohen | Terje B. Holmlund | Brita Elvevåg
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

Verbal memory is affected by numerous clinical conditions and most neuropsychological and clinical examinations evaluate it. However, a bottleneck exists in such endeavors because traditional methods require expert human review, and usually only a couple of test versions exist, thus limiting the frequency of administration and clinical applications. The present study overcomes this bottleneck by automating the administration, transcription, analysis and scoring of story recall. A large group of healthy participants (n = 120) and patients with mental illness (n = 105) interacted with a mobile application that administered a wide range of assessments, including verbal memory. The resulting speech generated by participants when retelling stories from the memory task was transcribed using automatic speech recognition tools, which was compared with human transcriptions (overall word error rate = 21%). An assortment of surface-level and semantic language-based features were extracted from the verbal recalls. A final set of three features were used to both predict expert human ratings with a ridge regression model (r = 0.88) and to differentiate patients from healthy individuals with an ensemble of logistic regression classifiers (accuracy = 76%). This is the first ‘outside of the laboratory’ study to showcase the viability of the complete pipeline of automated assessment of verbal memory in naturalistic settings.

Co-authors

Venues

Fix data

Jian Cheng

Fixing paper assignments

2025

2021

2019

2015

2014

2011

2009

Co-authors

Venues