2025
pdf
bib
abs
WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models
Huawen Feng
|
Pu Zhao
|
Qingfeng Sun
|
Can Xu
|
Fangkai Yang
|
Lu Wang
|
Qianli Ma
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to collect complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from a limited set of proprietary LLMs (e.g., Claude, GPT4, and so on), which restricts the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose **WarriorCoder**, a novel paradigm learns from expert battles to address these limitations. Specifically, we create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges. This competitive framework generates novel training data from scratch, leveraging the strengths of all participants. Experimental results show that **WarriorCoder** achieves state-of-the-art performance compared to previous models of the same size, even without relying on proprietary LLMs.
pdf
bib
abs
ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation
Minghua He
|
Yue Chen
|
Fangkai Yang
|
Pu Zhao
|
Wenjie Yin
|
Yu Kang
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Code translation is a crucial activity in the software development and maintenance process, and researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn the contextual semantics of code during pre-training, neglecting executability information closely related to the execution state of the code, which results in unguaranteed code executability and unreliable automated code translation. To address this issue, we propose ExeCoder, an LLM specifically designed for code translation, aimed at utilizing executability representations such as functional semantics, syntax structures, and variable dependencies to enhance the capabilities of LLMs in code translation. To evaluate the effectiveness of ExeCoder, we manually enhanced the widely used benchmark TransCoder-test, resulting in a benchmark called TransCoder-test-X that serves LLMs. Evaluation of TransCoder-test-X indicates that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics, and even outperforms the renowned closed-source LLM GPT-4o. Code is available at https://aka.ms/execoder
pdf
bib
abs
Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
Kaikai An
|
Fangkai Yang
|
Liqun Li
|
Junting Lu
|
Sitao Cheng
|
Shuzheng Si
|
Lu Wang
|
Pu Zhao
|
Lele Cao
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Baobao Chang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advances in retrieval-augmented generation (RAG) have substantially improved question-answering systems, particularly for factoid ‘5Ws’ questions. However, significant challenges remain when addressing ‘1H’ questions, specifically how-to questions, which are integral for decision-making and require dynamic, step-by-step responses. The key limitation lies in the prevalent data organization paradigm, chunk, which commonly divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To address this, we propose THREAD, a novel data organization paradigm enabling systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, ‘logic unit’ (LU), where large language models transform documents into more structured and loosely interconnected LUs. Extensive experiments across both open-domain and industrial settings show that THREAD outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Additionally, THREAD demonstrates high adaptability across diverse document formats, reducing retrieval information by up to 75% compared to chunk, and also shows better generalizability to ‘5Ws’ questions, such as multi-hop questions, outperforming other paradigms.
pdf
bib
abs
Token-level Proximal Policy Optimization for Query Generation
Yichen Ouyang
|
Lu Wang
|
Fangkai Yang
|
Pu Zhao
|
Chenghua Huang
|
Jianfeng Liu
|
Bochen Pang
|
Yaming Yang
|
Yuefeng Zhan
|
Hao Sun
|
Qingwei Lin
|
Saravan Rajmohan
|
Weiwei Deng
|
Dongmei Zhang
|
Feng Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. We conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine, demonstrating that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.
pdf
bib
abs
AdaptFlow: Adaptive Workflow Optimization via Meta-Learning
Runchuan Zhu
|
Bowen Jiang
|
Lingrui Mei
|
Fangkai Yang
|
Lu Wang
|
Haoxiang Gao
|
Fengshuo Bai
|
Pu Zhao
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows—structured sequences of LLM invocations designed to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learning framework inspired by model-agnostic meta-learning (MAML). AdaptFlow uses a bi-level optimization process: the inner loop performs task-specific adaptation via LLM-generated feedback, while the outer loop consolidates these refinements into a shared, generalizable initialization. Evaluated across question answering, code generation, and mathematical reasoning benchmarks, AdaptFlow consistently outperforms both manually crafted and automatically searched baselines, achieving state-of-the-art results with strong generalization across tasks and models.
pdf
bib
abs
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Xing Zhang
|
Jiaheng Wen
|
Fangkai Yang
|
Yu Kang
|
Pu Zhao
|
Junhao Wang
|
Maoquan Wang
|
Yufan Huang
|
Shengyu Fu
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Code translation benchmarks are essential for evaluating the accuracy and efficiency of LLM-based systems. Existing benchmarks mainly target individual functions, overlooking repository-level challenges like intermodule coherence and dependency management. Recent repository-level efforts exist, but suffer from poor maintainability and coarse evaluation granularity. We introduce Skeleton-Guided-Translation, a framework for benchmarking Java-to-C# translation at the repository level, featuring fine-grained quality evaluation. It follows a two-step process: first translating repository “skeletons”, then refining the entire repository guided by these skeletons. Based on this, we present TRANSREPO-BENCH , the first test-driven benchmark of high-quality Java repositories paired with C# skeletons, unit tests, and build configurations. Our adaptive unit tests support multiple and incremental translations without manual tuning, enhancing automation and scalability. We also propose fine-grained metrics that evaluate translation quality per test case, overcoming limitations of binary metrics in distinguishing build failures. Evaluations using TRANSREPO-BENCH reveal issues like broken cross-file references, showing that our structured approach reduces dependency errors and preserves interface consistency.
pdf
bib
abs
ICL-Bandit: Relevance Labeling in Advertisement Recommendation Systems via LLM
Lu Wang
|
Chiming Duan
|
Pu Zhao
|
Fangkai Yang
|
Yong Shi
|
Xuefeng Luo
|
Bingjing Xu
|
Weiwei Deng
|
Qingwei Lin
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Measuring the relevance between user queries and advertisements is a critical task for advertisement (ad) recommendation systems, such as Microsoft Bing Ads and Google Ads. Traditionally, this requires expert data labeling, which is both costly and time-consuming. Recent advances have explored using Large Language Models (LLMs) for labeling, but these models often lack domain-specific knowledge. In-context learning (ICL), which involves providing a few demonstrations, is a common practice to enhance LLM performance on domain-specific tasks. However, retrieving high-quality demonstrations in a vast exploration space remains challenging. In this paper, we introduce ICL-Bandit, a practical and effective approach that leverages ICL to enhance the query-ad relevance labeling capabilities of LLMs. We develop a novel bandit learning method to identify and provide superior demonstrations for ICL, thereby improving labeling performance. Experimental results demonstrate that ICL-Bandit achieves state-of-the-art performance compared to existing methods. Additionally, ICL-Bandit has been deployed in Company X, that serves billions of users worldwide, confirming its robustness and effectiveness.