Mingqi Wu
2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Yuming Yang | Mingyoung Lai | Wanxu Zhao | Xiaoran Fan | Zhiheng Xi | Mingqi Wu | Chiyue Huang | Jun Zhao | Haijun Lv | Jian Tong | Yunhua Zhou | Yicheng Zou | Qipeng Guo | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuming Yang | Mingyoung Lai | Wanxu Zhao | Xiaoran Fan | Zhiheng Xi | Mingqi Wu | Chiyue Huang | Jun Zhao | Haijun Lv | Jian Tong | Yunhua Zhou | Yicheng Zou | Qipeng Guo | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank–Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model.Concretely, RSR is defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.
2025
Pretraining Context Compressor for Large Language Models with Embedding-Based Memory
Yuhong Dai | Jianxun Lian | Yitian Huang | Wei Zhang | Mingyang Zhou | Mingqi Wu | Xing Xie | Hao Liao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuhong Dai | Jianxun Lian | Yitian Huang | Wei Zhang | Mingyang Zhou | Mingqi Wu | Xing Xie | Hao Liao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Efficient processing of long contexts in large language models (LLMs) is essential for real-world applications like retrieval-augmented generation and in-context learning, especially in resource-constrained environments such as edge computing. This paper explores the embedding-based context compression to reduce inference costs while preserving the downstream LLM configurations. We propose a decoupled compressor-LLM framework, pretrained on text reconstruction and completion tasks, designed to effectively preserve essential contextual information within condensed embedding representations. Our extensive experiments investigate pretraining, model configurations, compression rates, efficiency across tasks, and adaptability to various LLMs. Results demonstrate that our approach outperforms competitive baselines in three domains and across eight datasets while being adaptable to different downstream LLMs. We find that thorough pretraining and carefully selected compression rates, such as 4x and 16x, enable a lightweight compressor to achieve a good balance between accuracy and speed. These findings underscore the potential of embedding-based compression to enhance LLM efficiency and motivate further research in this area.
Search
Fix author
Co-authors
- Tao Gui 2
- Xuan-Jing Huang (黄萱菁) 2
- Zhiheng Xi 2
- Qi Zhang 2
- Mingxu Chai 1
- Yuhong Dai 1
- Jingyi Deng 1
- Shihan Dou 1
- Xiaoran Fan 1
- Qipeng Guo 1
- Chiyue Huang 1
- Yitian Huang 1
- Yueyuan Huang 1
- Changhao Jiang 1
- Mingyoung Lai 1
- Jianxun Lian 1
- Hao Liao 1
- Shichun Liu 1
- Haijun Lv 1
- Qiyuan Peng 1
- Huayu Sha 1
- Yujiong Shen 1
- Kexin Tan 1
- Jian Tong 1
- Jingqi Tong 1
- Junzhe Wang 1
- Yuhui Wang 1
- Yilong Wu 1
- Xing Xie 1
- Yuming Yang 1
- Ming Zhang 1
- Wei Zhang 1
- Yue Zhang 1
- Zhihao Zhang 1
- Jun Zhao 1
- Wanxu Zhao 1
- Mingyang Zhou 1
- Yunhua Zhou 1
- Yicheng Zou 1
Venues
- ACL3