Thanh-Toan Do
2026
Layer-Wise High-Impact Parameter Ratio Optimization in Post-Training Quantization for Large Language Models
Cuong Pham | Anh Dung Hoang | Cuong C. Nguyen | Trung Le | Gustavo Carneiro | Thanh-Toan Do
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cuong Pham | Anh Dung Hoang | Cuong C. Nguyen | Trung Le | Gustavo Carneiro | Thanh-Toan Do
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have advanced natural language processing, but their massive parameter counts create computational and memory challenges during deployment. Post-training quantization (PTQ) has emerged as a promising approach to mitigate these challenges. While existing PTQ methods can effectively quantize LLMs, they experience substantial accuracy loss at extremely low bit-widths due to high-impact parameters. Several approaches address this by retaining high-impact parameters in FP16 format, but they apply fixed ratios across all layers, overlooking layer-wise sensitivity variations. We propose a quadratic optimization framework that determines layer-specific ratios of high-impact parameters while considering inter-layer dependencies. We quantize high-impact parameters to moderate bit-widths while the remaining parameters are quantized to extremely low bit-widths. Under the same resource budget, this preserves more high-impact parameters than methods retaining a few in FP16 format. Our framework enables leveraging advanced quantization methods for high-impact parameters while applying lightweight computational quantization methods to the rest, achieving an effective balance between computational efficiency and accuracy during quantization process.
2025
MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora
Tuan-Luc Huynh | Thuy-Trang Vu | Weiqing Wang | Trung Le | Dragan Gasevic | Yuan-Fang Li | Thanh-Toan Do
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tuan-Luc Huynh | Thuy-Trang Vu | Weiqing Wang | Trung Le | Dragan Gasevic | Yuan-Fang Li | Thanh-Toan Do
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.