Chi-Chih Chang


2025

pdf bib
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Yu-Chen Lu | Chong-Yan Chen | Chi-Chih Chang | Yu-Fang Hu | Kai-Chiang Wu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.