RQT: Hierarchical Residual Quantization for Multi-Model Compression
Chen Tianqi, Peisong Wang, Weixiang Xu, Zeyu Zhu, Jian Cheng
Abstract
Delta compression methods focus on efficiently serving multiple uniquely fine-tuned models, each tailored to specific tasks and user requirements. These approaches decompose a fine-tuned LLM into a base model and corresponding delta weights, which are compressed using low-rank or low-bit representations to reduce storage costs. However, their effectiveness is highly sensitive to the magnitude of the model deltas—a factor directly influenced by the scale of the training data. We propose the Residual Quantization Tree (RQT), a hierarchical quantization framework that automatically shares low-bit integer weights across similar fine-tuned models. The RQT construction employs a two-phase greedy algorithm: a bottom-up aggregation of models based on weight matrix similarity, and top-down residual quantization, in which each node optimizes the quantization parameters and then delegates residual errors to child nodes. We evaluate RQT on fine-tuned models across mathematics, coding, chatbot, and Chinese LLMs. The results show that RQT achieves an average accuracy degradation of approximately 3% (comparable to previous 4-bit post-training quantization) while maintaining an effective bitwidth of around 2 bits.- Anthology ID:
- 2025.findings-acl.554
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10644–10660
- Language:
- URL:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.554/
- DOI:
- Cite (ACL):
- Chen Tianqi, Peisong Wang, Weixiang Xu, Zeyu Zhu, and Jian Cheng. 2025. RQT: Hierarchical Residual Quantization for Multi-Model Compression. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10644–10660, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- RQT: Hierarchical Residual Quantization for Multi-Model Compression (Tianqi et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.554.pdf