2025
pdf
bib
abs
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo
|
Xin Zhang
|
Xiao Liu
|
Haoling Li
|
Yeyun Gong
|
Qi Chen
|
Peng Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to either sampling training data from different domains with static proportions or dynamically adjusting these proportions during training to optimise pretraining performance. However, few methods addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework that dynamically assesses learning velocity and adjusts data proportions accordingly, favouring slower learning domains while de-emphasising faster learning ones, which is guided by a scaling law to estimate the desired learning goal for each domain with a less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments on a dataset focused on reasoning tasks with CodeLlama, as well as on a corpus of system commands using Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target estimation and data ordering.
pdf
bib
abs
Teaching Your Models to Understand Code via Focal Preference Alignment
Jie Wu
|
Haoling Li
|
Xin Zhang
|
Xiao Liu
|
Yangyu Huang
|
Jianwen Luo
|
Yizhen Zhang
|
Zuchao Li
|
Ruihang Chu
|
Yujiu Yang
|
Scarlett Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
pdf
bib
abs
Process-based Self-Rewarding Language Models
Shimao Zhang
|
Xiao Liu
|
Xin Zhang
|
Junxiao Liu
|
Zheheng Luo
|
Shujian Huang
|
Yeyun Gong
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs’ performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of process-based self-rewarding to achieve LLM reasoning that may surpass human capabilities.
pdf
bib
abs
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning
Sungjin Park
|
Xiao Liu
|
Yeyun Gong
|
Edward Choi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Despite recent advances in large language models, open-source models often struggle to consistently perform well on complex reasoning tasks. Existing ensemble methods, whether applied at the token or output levels, fail to address these challenges. In response, we present Language model Ensemble with Monte Carlo Tree Search (LE-MCTS), a novel framework for process-level ensembling of language models. LE-MCTS formulates step-by-step reasoning with an ensemble of language models as a Markov decision process. In this framework, states represent intermediate reasoning paths, while actions consist of generating the next reasoning step using one of the language models selected from a predefined pool. Guided by a process-based reward model, LE-MCTS performs a tree search over the reasoning steps generated by different language models, identifying the most accurate reasoning chain. Experimental results on five mathematical reasoning benchmarks demonstrate that our approach outperforms both single language model decoding algorithms and language model ensemble methods. Notably, LE-MCTS improves performance by 3.6% and 4.3% on the MATH and MQA datasets, respectively, highlighting its effectiveness in solving complex reasoning problems.
2024
pdf
bib
abs
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Yiming Huang
|
Jianwen Luo
|
Yan Yu
|
Yitong Zhang
|
Fangyu Lei
|
Yifan Wei
|
Shizhu He
|
Lifu Huang
|
Xiao Liu
|
Jun Zhao
|
Kang Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, including Python and SQL, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously designed the evaluation suite to ensure the accuracy and robustness of the evaluation. We developed the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at [link](https://github.com/yiyihum/dabench)
pdf
bib
abs
Competition-Level Problems are Effective LLM Evaluators
Yiming Huang
|
Zhenghao Lin
|
Xiao Liu
|
Yeyun Gong
|
Shuai Lu
|
Fangyu Lei
|
Yaobo Liang
|
Yelong Shen
|
Chen Lin
|
Nan Duan
|
Weizhu Chen
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4’s perceived zero-shot performance on this task, considering various aspects such as problems’ release time, difficulties, and types of errors encountered. Surprisingly, the perceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification. Unfortunately, none of them is able to consistently mitigate the challenges. Through our work, we emphasize the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.
2023
pdf
bib
abs
Boosting Event Extraction with Denoised Structure-to-Text Augmentation
Bo Wang
|
Heyan Huang
|
Xiaochi Wei
|
Ge Shi
|
Xiao Liu
|
Chong Feng
|
Tong Zhou
|
Shuaiqiang Wang
|
Dawei Yin
Findings of the Association for Computational Linguistics: ACL 2023
Event extraction aims to recognize pre-defined event triggers and arguments from texts, which suffer from the lack of high-quality annotations. In most NLP applications, involving a large scale of synthetic training data is a practical and effective approach to alleviate the problem of data scarcity. However, when applying to the task of event extraction, recent data augmentation methods often neglect the problem of grammatical incorrectness, structure misalignment, and semantic drifting, leading to unsatisfactory performances. In order to solve these problems, we propose a denoised structure-to-text augmentation framework for event extraction (DAEE), which generates additional training data through the knowledge-based structure-to-text generation model and selects the effective subset from the generated data iteratively with a deep reinforcement learning agent. Experimental results on several datasets demonstrate that the proposed method generates more diverse text representations for event extraction and achieves comparable results with the state-of-the-art.
2022
pdf
bib
abs
Dynamic Prefix-Tuning for Generative Template-based Event Extraction
Xiao Liu
|
Heyan Huang
|
Ge Shi
|
Bo Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We consider event extraction in a generative manner with template-based conditional generation. Although there is a rising trend of casting the task of event extraction as a sequence generation problem with prompts, these generation-based methods have two significant challenges, including using suboptimal prompts and static event type information. In this paper, we propose a generative template-based event extraction method with dynamic prefix (GTEE-DynPref) by integrating context information with type-specific prefixes to learn a context-specific prefix for each context. Experimental results show that our model achieves competitive results with the state-of-the-art classification-based model OneIE on ACE 2005 and achieves the best performances on ERE.Additionally, our model is proven to be portable to new types of events effectively.