Ming Zhang
张明; Fudan
Other people with similar names: Ming Zhang (Peking)
Unverified author pages with similar names: Ming Zhang
2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.
2025
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
Ming Zhang | Yuhui Wang | Yujiong Shen | Tingyi Yang | Changhao Jiang | Yilong Wu | Shihan Dou | Qinhao Chen | Zhiheng Xi | Zhihao Zhang | Yi Dong | Zhen Wang | Zhihui Fei | Mingyang Wan | Tao Liang | Guojun Ma | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Ming Zhang | Yuhui Wang | Yujiong Shen | Tingyi Yang | Changhao Jiang | Yilong Wu | Shihan Dou | Qinhao Chen | Zhiheng Xi | Zhihao Zhang | Yi Dong | Zhen Wang | Zhihui Fei | Mingyang Wan | Tao Liang | Guojun Ma | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models’ performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
2024
Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems
Jun Zhao | Jingqi Tong | Yurong Mou | Ming Zhang | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jun Zhao | Jingqi Tong | Yurong Mou | Ming Zhang | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset MathTrap by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8K. Since problems with logical flaws are quite rare in the real world, these represent “unseen” cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not spontaneously combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. We find that LLMs’ performance can be improved through the above external intervention. Overall, systematic compositionality remains an open challenge for large language models.
TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information collection. How to utilize TOD accurately, efficiently and effectively for information collection has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, **TransferTOD**, which authentically simulates human-computer dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a model using full-parameter fine-tuning called **TransferTOD-7B**, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
Search
Fix author
Co-authors
- Xuan-Jing Huang (黄萱菁) 7
- Shihan Dou 6
- Tao Gui 6
- Yujiong Shen 5
- Zhiheng Xi 5
- Changhao Jiang 4
- Jingqi Tong 4
- Qi Zhang 4
- Mingxu Chai 3
- Shichun Liu 3
- Yilong Wu 3
- Junjie Ye (叶俊杰) 3
- Qi Zhang 3
- Yi Dong 2
- Caishuang Huang 2
- Chenhao Huang 2
- Huayu Sha 2
- Yuhui Wang 2
- Yuhui Wang 2
- Jun Zhao 2
- Yifei Cao 1
- Qinhao Chen 1
- Jingyi Deng 1
- Yurui Dong 1
- Baoyu Fan 1
- Xiaoran Fan 1
- Zhihui Fei 1
- Binze Hu 1
- Yueyuan Huang 1
- Senjie Jin 1
- Zelin Li 1
- Tao Liang 1
- Guojun Ma 1
- Yurong Mou 1
- Qiyuan Peng 1
- Xipeng Qiu (邱锡鹏) 1
- Jiajun Sun 1
- Kexin Tan 1
- Mingyang Wan 1
- Junzhe Wang 1
- Yuran Wang 1
- Zhen Wang 1
- Mingqi Wu 1
- Tingyi Yang 1
- Guoqiang Zhang 1
- Jiazheng Zhang 1
- Yue Zhang 1
- Yunke Zhang 1
- Zhihao Zhang 1
- Zhihao Zhang 1
- Huiyuan Zheng 1
- Enyu Zhou 1
- Dingwei Zhu 1