2025
pdf
bib
abs
Revisit Self-Debugging with Self-Generated Tests for Code Generation
Xiancai Chen
|
Zhengwei Tao
|
Kechi Zhang
|
Changzhi Zhou
|
Xinyu Zhang
|
Wanli Gu
|
Yuanpeng He
|
Mengdi Zhang
|
Xunliang Cai
|
Haiyan Zhao
|
Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated significant advancements in code generation, yet they still face challenges when tackling tasks that extend beyond their basic capabilities. Recently, the concept of self-debugging has been proposed as a way to enhance code generation performance by leveraging execution feedback from tests. However, the availability of high-quality tests in real-world scenarios is often limited. In this context, self-debugging with self-generated tests emerges as a promising solution, though its limitations and practical potential have not been fully explored. To address this gap, we investigate the efficacy of self-debugging in code generation tasks. We propose and analyze two distinct paradigms for the self-debugging process: post-execution and in-execution self-debugging. Our findings reveal that post-execution self-debugging struggles with the test bias introduced by self-generated tests, which can lead to misleading feedback. In contrast, in-execution self-debugging enables LLMs to mitigate this bias and leverage intermediate states during program execution. By focusing on runtime information rather than relying solely on potentially flawed self-generated tests, this approach demonstrates significant promise for improving the robustness and accuracy of LLMs in code generation tasks.
pdf
bib
abs
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights
Yufang Liu
|
Yao Du
|
Tao Ji
|
Jianing Wang
|
Yang Liu
|
Yuanbin Wu
|
Aimin Zhou
|
Mengdi Zhang
|
Xunliang Cai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning.
pdf
bib
abs
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning
Jin Jiang
|
Yuchen Yan
|
Yang Liu
|
Jianing Wang
|
Shuai Peng
|
Xunliang Cai
|
Yixin Cao
|
Mengdi Zhang
|
Liangcai Gao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In this paper, we propose a new data synthesis method called LogicPro, which leverages LeetCode-style algorithm Problems and their corresponding Program solutions to synthesize Complex Logical Reasoning data in text format. First, we synthesize complex reasoning problems through source algorithm problems and test cases. Then, standard answers and intermediate variable outputs are obtained for each problem based on standard python solutions and test cases. Finally, with the guidance of code intermediate variables, we synthesize the text reasoning process for each reasoning problems. Through this method, we can synthesize data that is difficult, scalable, effective, and comes with golden standard answers and high-quality reasoning processes. As a result, with our 540K synthesized dataset constructed solely from 2,360 algorithm problems, our approach achieves significant improvements in multiple models for the datasets BBH^27, LogicBench, DROP, AR-LSAT, and GSM8K, etc. outperforming a wide range of existing reasoning datasets.
2024
pdf
bib
abs
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning
Yejie Wang
|
Keqing He
|
Guanting Dong
|
Pei Wang
|
Weihao Zeng
|
Muxi Diao
|
Weiran Xu
|
Jingang Wang
|
Mengdi Zhang
|
Xunliang Cai
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Various instruction finetuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this paper, we introduce a diverse instruction model DolphCoder with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks, demonstrating new insights for future code instruction tuning work. Our key findings are: (1) Augmenting more diverse responses with more distinct reasoning paths increases the code capability of LLMs. (2) Improving one’s ability to evaluate the correctness of code also enhances their ability to create it.
pdf
bib
abs
How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data
Yejie Wang
|
Keqing He
|
Dayuan Fu
|
Zhuoma GongQue
|
Heyang Xu
|
Yanxu Chen
|
Zhexu Wang
|
Yujia Fu
|
Guanting Dong
|
Muxi Diao
|
Jingang Wang
|
Mengdi Zhang
|
Xunliang Cai
|
Weiran Xu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show Xcoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs.
2022
pdf
bib
abs
Incorporating Dynamic Semantics into Pre-Trained Language Model for Aspect-based Sentiment Analysis
Kai Zhang
|
Kun Zhang
|
Mengdi Zhang
|
Hongke Zhao
|
Qi Liu
|
Wei Wu
|
Enhong Chen
Findings of the Association for Computational Linguistics: ACL 2022
Aspect-based sentiment analysis (ABSA) predicts sentiment polarity towards a specific aspect in the given sentence. While pre-trained language models such as BERT have achieved great success, incorporating dynamic semantic changes into ABSA remains challenging. To this end, in this paper, we propose to address this problem by Dynamic Re-weighting BERT (DR-BERT), a novel method designed to learn dynamic aspect-oriented semantics for ABSA. Specifically, we first take the Stack-BERT layers as a primary encoder to grasp the overall semantic of the sentence and then fine-tune it by incorporating a lightweight Dynamic Re-weighting Adapter (DRA). Note that the DRA can pay close attention to a small region of the sentences at each step and re-weigh the vitally important words for better aspect-aware sentiment understanding. Finally, experimental results on three benchmark datasets demonstrate the effectiveness and the rationality of our proposed model and provide good interpretable insights for future semantic modeling.
2015
pdf
bib
Learning Topic Hierarchies for Wikipedia Categories
Linmei Hu
|
Xuzhong Wang
|
Mengdi Zhang
|
Juanzi Li
|
Xiaoli Li
|
Chao Shao
|
Jie Tang
|
Yongbin Liu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)