Li Jiang
2026
Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
Zhouxuwen | Fangxin Liu | Chao Wang | Xiao Zheng | Hao Zheng | Min He | Li Jiang | Haibing Guan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhouxuwen | Fangxin Liu | Chao Wang | Xiao Zheng | Hao Zheng | Min He | Li Jiang | Haibing Guan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probability-Guarded Acceptance," CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.
2025
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
Fangxin Liu | Zongwu Wang | Jinhong Xia | Junping Zhao | Shouren Zhao | Jinjin Li | Jian Liu | Li Jiang | Haibing Guan
Findings of the Association for Computational Linguistics: EMNLP 2025
Fangxin Liu | Zongwu Wang | Jinhong Xia | Junping Zhao | Shouren Zhao | Jinjin Li | Jian Liu | Li Jiang | Haibing Guan
Findings of the Association for Computational Linguistics: EMNLP 2025
The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3× end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment.
2011
AIR-based light clients for supporting Moses engine training
Jeffrey Rueppel | Li Jiang | Gong Yu | Ray Flournoy
Proceedings of Machine Translation Summit XIII: System Presentations
Jeffrey Rueppel | Li Jiang | Gong Yu | Ray Flournoy
Proceedings of Machine Translation Summit XIII: System Presentations