Yushi Huang
2026
Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
Lingkun Long | Yushi Huang | Shihao Bai | Ruihao Gong | Jun Zhang | Ao Zhou | Jianlei Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lingkun Long | Yushi Huang | Shihao Bai | Ruihao Gong | Jun Zhang | Ao Zhou | Jianlei Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present **Focus-dLLM**, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a *past confidence-guided indicator* to predict unmasked regions. Built upon this, we propose a *sink-aware pruning strategy* to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than 29× lossless speedup under 32K context length.
2024
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit
Ruihao Gong | Yang Yong | Shiqiao Gu | Yushi Huang | Chengtao Lv | Yunchen Zhang | Dacheng Tao | Xianglong Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Ruihao Gong | Yang Yong | Shiqiao Gu | Yushi Huang | Chengtao Lv | Yunchen Zhang | Dacheng Tao | Xianglong Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardware, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our toolkit is available at https://github.com/ModelTC/llmc.