Jingqi Tong
2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
VideoPro: Adaptive Program Reasoning for Long Video Understanding
Chenglin Li | Feng Han | Yikun Wang | Ruilin Li | Shuai Dong | Haowen Hou | Haitao Li | Qianglong Chen | Feng Tao | Jingqi Tong | Yin Zhang | Jiaqi Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenglin Li | Feng Han | Yikun Wang | Ruilin Li | Shuai Dong | Haowen Hou | Haitao Li | Qianglong Chen | Feng Tao | Jingqi Tong | Yin Zhang | Jiaqi Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Understanding long videos remains challenging due to the sparsity of visual evidence relevant to a given query. Prior work has explored program-based visual grounding, typically relying on executable programs generated by auxiliary large language models. However, when scaling to long videos, existing approaches face several critical limitations: (1) frame-centric vision modules are often insufficient for long video processing; (2) naively applying program-based reasoning to all queries incurs considerable computational overhead; and (3) errors arising from low-confidence predictions and imperfect program execution are difficult to recover from. To address these challenges, we propose VideoPro, a unified framework that enables VideoLLMs to adaptively reason over long videos and refine their predictions through executable programs. VideoPro first performs adaptive reasoning, dynamically determining whether a query can be resolved directly by the native VideoLLM or requires explicit multi-step program reasoning. For complex queries, the model decomposes the task into executable programs that invoke specialized vision modules for precise temporal and semantic grounding. To further improve robustness, VideoPro incorporates a self-refinement mechanism that leverages execution feedback and confidence signals to correct erroneous executions and refine low-confidence reasoning programs. By tightly integrating adaptive reasoning with self-refinement, VideoPro consistently outperforms prior methods across multiple long-video understanding benchmarks, yielding an average 6.7% improvement for Qwen3-VL-8B.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models
Qiguang Chen | Chengyu Luan | Jiajun Wu | Qiming Yu | Yi Yang | Yizhuo Li | Jingqi Tong | Xiachong Feng | Libo Qin | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiguang Chen | Chengyu Luan | Jiajun Wu | Qiming Yu | Yi Yang | Yizhuo Li | Jingqi Tong | Xiachong Feng | Libo Qin | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
2025
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
2024
Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems
Jun Zhao | Jingqi Tong | Yurong Mou | Ming Zhang | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jun Zhao | Jingqi Tong | Yurong Mou | Ming Zhang | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset MathTrap by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8K. Since problems with logical flaws are quite rare in the real world, these represent “unseen” cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not spontaneously combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. We find that LLMs’ performance can be improved through the above external intervention. Overall, systematic compositionality remains an open challenge for large language models.
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-Training
Tong Zhu | Xiaoye Qu | Daize Dong | Jiacheng Ruan | Jingqi Tong | Conghui He | Yu Cheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tong Zhu | Xiaoye Qu | Daize Dong | Jiacheng Ruan | Jingqi Tong | Conghui He | Yu Cheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters.
Search
Fix author
Co-authors
- Xuan-Jing Huang (黄萱菁) 4
- Ming Zhang 4
- Shihan Dou 3
- Tao Gui 3
- Changhao Jiang 3
- Yujiong Shen 3
- Zhiheng Xi 3
- Mingxu Chai 2
- Shichun Liu 2
- Huayu Sha 2
- Qi Zhang 2
- Qi Zhang 2
- Yifei Cao 1
- Wanxiang Che (车万翔) 1
- Qianglong Chen 1
- Qiguang Chen (陈麒光) 1
- Yu Cheng 1
- Jingyi Deng 1
- Daize Dong 1
- Shuai Dong 1
- Yi Dong 1
- Baoyu Fan 1
- Xiaoran Fan 1
- Xiachong Feng 1
- Feng Han 1
- Conghui He 1
- Haowen Hou 1
- Binze Hu 1
- Chenhao Huang 1
- Yueyuan Huang 1
- Chenglin Li 1
- Haitao Li 1
- Ruilin Li 1
- Yizhuo Li 1
- Zelin Li 1
- Chengyu Luan 1
- Yurong Mou 1
- Qiyuan Peng 1
- Libo Qin 1
- Xiaoye Qu 1
- Jiacheng Ruan 1
- Jiajun Sun 1
- Kexin Tan 1
- Feng Tao 1
- Jiaqi Wang 1
- Junzhe Wang 1
- Yikun Wang 1
- Yuhui Wang 1
- Yuhui Wang 1
- Jiajun Wu 1
- Mingqi Wu 1
- Yilong Wu 1
- Yi Yang 1
- Junjie Ye (叶俊杰) 1
- Qiming Yu 1
- Yin Zhang 1
- Yue Zhang 1
- Zhihao Zhang 1
- Jun Zhao 1
- Tong Zhu (朱桐) 1