Yuansheng Ni
2026
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Songcheng Cai | Zhiheng Lyu | Yuansheng Ni | Xiangchao Chen | Baichuan Zhou | Shenzhe Zhu | Yi Lu | Haozhe Wang | Chi Ruan | Benjamin Schneider | Weixu Zhang | Xiang Li | Andy Zheng | Yuyu Zhang | Ping Nie | Wenhu Chen
Findings of the Association for Computational Linguistics: ACL 2026
Songcheng Cai | Zhiheng Lyu | Yuansheng Ni | Xiangchao Chen | Baichuan Zhou | Shenzhe Zhu | Yi Lu | Haozhe Wang | Chi Ruan | Benjamin Schneider | Weixu Zhang | Xiang Li | Andy Zheng | Yuyu Zhang | Ping Nie | Wenhu Chen
Findings of the Association for Computational Linguistics: ACL 2026
Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
2025
VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation
Yuansheng Ni | Ping Nie | Kai Zou | Xiang Yue | Wenhu Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Yuansheng Ni | Ping Nie | Kai Zou | Xiang Yue | Wenhu Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present **VisCode-200K**, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create **VisCoder**, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue | Tianyu Zheng | Yuansheng Ni | Yubo Wang | Kai Zhang | Shengbang Tong | Yuxuan Sun | Botao Yu | Ge Zhang | Huan Sun | Yu Su | Wenhu Chen | Graham Neubig
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiang Yue | Tianyu Zheng | Yuansheng Ni | Yubo Wang | Kai Zhang | Shengbang Tong | Yuxuan Sun | Botao Yu | Ge Zhang | Huan Sun | Yu Su | Wenhu Chen | Graham Neubig
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly “see” and “read” simultaneously, testing a core human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future multimodal research.
2024
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He | Dongfu Jiang | Ge Zhang | Max Ku | Achint Soni | Sherman Siu | Haonan Chen | Abhranil Chandra | Ziyan Jiang | Aaran Arulraj | Kai Wang | Quy Duc Do | Yuansheng Ni | Bohan Lyu | Yaswanth Narsupalli | Rongqi Fan | Zhiheng Lyu | Bill Yuchen Lin | Wenhu Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xuan He | Dongfu Jiang | Ge Zhang | Max Ku | Achint Soni | Sherman Siu | Haonan Chen | Abhranil Chandra | Ziyan Jiang | Aaran Arulraj | Kai Wang | Quy Duc Do | Yuansheng Ni | Bohan Lyu | Yaswanth Narsupalli | Rongqi Fan | Zhiheng Lyu | Bill Yuchen Lin | Wenhu Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis)based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman’s correlation betweenVideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result onother held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with humanjudges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models
Peng Wang | Ningyu Zhang | Bozhong Tian | Zekun Xi | Yunzhi Yao | Ziwen Xu | Mengru Wang | Shengyu Mao | Xiaohan Wang | Siyuan Cheng | Kangwei Liu | Yuansheng Ni | Guozhou Zheng | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Peng Wang | Ningyu Zhang | Bozhong Tian | Zekun Xi | Yunzhi Yao | Ziwen Xu | Mengru Wang | Shengyu Mao | Xiaohan Wang | Siyuan Cheng | Kangwei Liu | Yuansheng Ni | Guozhou Zheng | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged – aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners from applying knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video.
Search
Fix author
Co-authors
- Wenhu Chen 4
- Zhiheng Lyu 2
- Ping Nie 2
- Xiang Yue 2
- Ge Zhang 2
- Aaran Arulraj 1
- Songcheng Cai 1
- Abhranil Chandra 1
- Haonan Chen 1
- Huajun Chen 1
- Xiangchao Chen 1
- Siyuan Cheng 1
- Quy Duc Do 1
- Rongqi Fan 1
- Xuan He 1
- Dongfu Jiang 1
- Ziyan Jiang 1
- Max Ku 1
- Xiang Li 1
- Bill Yuchen Lin 1
- Kangwei Liu 1
- Yi Lu 1
- Bohan Lyu 1
- Shengyu Mao 1
- Yaswanth Narsupalli 1
- Graham Neubig 1
- Chi Ruan 1
- Benjamin Schneider 1
- Sherman Siu 1
- Achint Soni 1
- Yu Su 1
- Huan Sun 1
- Yuxuan Sun 1
- Bozhong Tian 1
- Shengbang Tong 1
- Haozhe Wang 1
- Kai Wang 1
- Mengru Wang 1
- Peng Wang 1
- Xiaohan Wang 1
- Yubo Wang 1
- Zekun Xi 1
- Ziwen Xu 1
- Yunzhi Yao 1
- Botao Yu 1
- Kai Zhang 1
- Ningyu Zhang 1
- Weixu Zhang 1
- Yuyu Zhang 1
- Andy Zheng 1
- Guozhou Zheng 1
- Tianyu Zheng 1
- Baichuan Zhou 1
- Shenzhe Zhu 1
- Kai Zou 1