RQT: Hierarchical Residual Quantization for Multi-Model Compression

Chen Tianqi; Peisong Wang; Weixiang Xu; Zeyu Zhu; Jian Cheng

RQT: Hierarchical Residual Quantization for Multi-Model Compression

Chen Tianqi, Peisong Wang, Weixiang Xu, Zeyu Zhu, Jian Cheng

Abstract

Delta compression methods focus on efficiently serving multiple uniquely fine-tuned models, each tailored to specific tasks and user requirements. These approaches decompose a fine-tuned LLM into a base model and corresponding delta weights, which are compressed using low-rank or low-bit representations to reduce storage costs. However, their effectiveness is highly sensitive to the magnitude of the model deltas—a factor directly influenced by the scale of the training data. We propose the Residual Quantization Tree (RQT), a hierarchical quantization framework that automatically shares low-bit integer weights across similar fine-tuned models. The RQT construction employs a two-phase greedy algorithm: a bottom-up aggregation of models based on weight matrix similarity, and top-down residual quantization, in which each node optimizes the quantization parameters and then delegates residual errors to child nodes. We evaluate RQT on fine-tuned models across mathematics, coding, chatbot, and Chinese LLMs. The results show that RQT achieves an average accuracy degradation of approximately 3% (comparable to previous 4-bit post-training quantization) while maintaining an effective bitwidth of around 2 bits.

Anthology ID:: 2025.findings-acl.554
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10644–10660
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.554/
DOI:
Bibkey:
Cite (ACL):: Chen Tianqi, Peisong Wang, Weixiang Xu, Zeyu Zhu, and Jian Cheng. 2025. RQT: Hierarchical Residual Quantization for Multi-Model Compression. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10644–10660, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: RQT: Hierarchical Residual Quantization for Multi-Model Compression (Tianqi et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.554.pdf

PDF Cite Search Fix data