Yuxuan Sun
2026
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?
Yuxuan Sun | Yuze Zhao | Yufeng Wang | Yao Du | Zhiyuan Ma | Jinbo Wang | Mengdi Zhang | Kai Zhang | Zhenya Huang
Findings of the Association for Computational Linguistics: ACL 2026
Yuxuan Sun | Yuze Zhao | Yufeng Wang | Yao Du | Zhiyuan Ma | Jinbo Wang | Mengdi Zhang | Kai Zhang | Zhenya Huang
Findings of the Association for Computational Linguistics: ACL 2026
Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to “fool” the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.
2025
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue | Tianyu Zheng | Yuansheng Ni | Yubo Wang | Kai Zhang | Shengbang Tong | Yuxuan Sun | Botao Yu | Ge Zhang | Huan Sun | Yu Su | Wenhu Chen | Graham Neubig
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiang Yue | Tianyu Zheng | Yuansheng Ni | Yubo Wang | Kai Zhang | Shengbang Tong | Yuxuan Sun | Botao Yu | Ge Zhang | Huan Sun | Yu Su | Wenhu Chen | Graham Neubig
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly “see” and “read” simultaneously, testing a core human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future multimodal research.
TestAgent: An Adaptive and Intelligent Expert for Human Assessment
Junhao Yu | Yan Zhuang | Yuxuan Sun | Weibo Gao | Qi Liu | Mingyue Cheng | Zhenya Huang | Enhong Chen
Findings of the Association for Computational Linguistics: ACL 2025
Junhao Yu | Yan Zhuang | Yuxuan Sun | Weibo Gao | Qi Liu | Mingyue Cheng | Zhenya Huang | Enhong Chen
Findings of the Association for Computational Linguistics: ACL 2025
Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement. This is the first application of LLMs in adaptive testing. TestAgent supports personalized question selection, captures test-takers’ responses and anomalies, and provides precise outcomes through dynamic, conversational interactions. Experiments on psychological, educational, and lifestyle assessments show our approach achieves more accurate results with 20% fewer questions than state-of-the-art baselines, and testers preferred it in speed, smoothness, and other dimensions.
2020
RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER
Lin Sun | Jiquan Wang | Yindu Su | Fangsheng Weng | Yuxuan Sun | Zengwei Zheng | Yuanyi Chen
Proceedings of the 28th International Conference on Computational Linguistics
Lin Sun | Jiquan Wang | Yindu Su | Fangsheng Weng | Yuxuan Sun | Zengwei Zheng | Yuanyi Chen
Proceedings of the 28th International Conference on Computational Linguistics
Multimodal named entity recognition (MNER) for tweets has received increasing attention recently. Most of the multimodal methods used attention mechanisms to capture the text-related visual information. However, unrelated or weakly related text-image pairs account for a large proportion in tweets. Visual clues unrelated to the text would incur uncertain or even negative effects for multimodal model learning. In this paper, we propose a novel pre-trained multimodal model based on Relationship Inference and Visual Attention (RIVA) for tweets. The RIVA model controls the attention-based visual clues with a gate regarding the role of image to the semantics of text. We use a teacher-student semi-supervised paradigm to leverage a large unlabeled multimodal tweet corpus with a labeled data set for text-image relation classification. In the multimodal NER task, the experimental results show the significance of text-related visual features for the visual-linguistic model and our approach achieves SOTA performance on the MNER datasets.
Search
Fix author
Co-authors
- Zhenya Huang 2
- Yuanyi Chen 1
- Wenhu Chen 1
- Enhong Chen 1
- Mingyue Cheng 1
- Yao Du 1
- Weibo Gao 1
- Qi Liu 1
- Zhiyuan Ma 1
- Graham Neubig 1
- Yuansheng Ni 1
- Yindu Su 1
- Yu Su 1
- Lin Sun 1
- Huan Sun 1
- Shengbang Tong 1
- Yufeng Wang 1
- Jinbo Wang 1
- Jiquan Wang 1
- Yubo Wang 1
- Fangsheng Weng 1
- Botao Yu 1
- Junhao Yu 1
- Xiang Yue 1
- Mengdi Zhang 1
- Kai Zhang 1
- Kai Zhang 1
- Ge Zhang 1
- Yuze Zhao 1
- Zengwei Zheng 1
- Tianyu Zheng 1
- Yan Zhuang 1