Chenyuan Yang
2026
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Junjie Ye | Caishuang Huang | Zhuohan Chen | Wenjie Fu | Chenyuan Yang | Leyi Yang | Yilong Wu | Peng Wang | Meng Zhou | Xiaolong Yang | Tao Gui | Qi Zhang | Zhongchao Shi | Jianping Fan | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Junjie Ye | Caishuang Huang | Zhuohan Chen | Wenjie Fu | Chenyuan Yang | Leyi Yang | Yilong Wu | Peng Wang | Meng Zhou | Xiaolong Yang | Tao Gui | Qi Zhang | Zhongchao Shi | Jianping Fan | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Instruction following refers to the ability of large language models (LLMs) to generate outputs that satisfy all specified constraints. Existing research has primarily focused on constraint categories, offering limited evaluation dimensions and little guidance for improving instruction-following abilities. To address this gap, we introduce MulDimIF, a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Based on this framework, we design a controllable instruction generation pipeline. Through constraint expansion, conflict detection, and instruction rewriting, we construct 9,106 code-verifiable samples. We evaluate 18 LLMs from six model families and find marked performance differences across constraint settings. For instance, average accuracy decreases from 80.82% at Level I to 36.76% at Level IV. Moreover, training with data generated by our framework significantly improves instruction following without compromising general performance. In-depth analysis indicates that these gains stem largely from parameter updates in attention modules, which strengthen constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
2025
TestEval: Benchmarking Large Language Models for Test Case Generation
Wenhan Wang | Chenyuan Yang | Zhijie Wang | Yuheng Huang | Zhaoyang Chu | Da Song | Lingming Zhang | An Ran Chen | Lei Ma
Findings of the Association for Computational Linguistics: NAACL 2025
Wenhan Wang | Chenyuan Yang | Zhijie Wang | Yuheng Huang | Zhaoyang Chu | Da Song | Lingming Zhang | An Ran Chen | Lei Ma
Findings of the Association for Computational Linguistics: NAACL 2025
For program languages, testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities.In this paper, we propose TestEval, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate 17 popular LLMs, including both commercial and open-source ones, on TestEval. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths.