Kai Chen - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Kai Chen

Other people with similar names: Kai Chen, Kai Chen, Kai Chen, Kai Chen, Kai Chen, Kai Chen, Kai Chen

Unverified author pages with similar names: Kai Chen

2026

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
Xiaozhe Li | Jixuan Chen | Xinyu Fang | Shengyuan Ding | Haodong Duan | Qingwen Liu | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem-solving—perception, reasoning, and memory—remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity—namely, the ability to continuously refine solutions in response to dynamic environmental feedback—remains underexplored. To address this challenge, we introduce OPT-BENCH, a benchmark for evaluating self-improvement capabilities in large-scale search spaces. By combining 20 machine learning tasks with 10 classic NP-hard problems, OPT-BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application. We further propose OPT-Agent, a framework that emulates human-like cognitive adaptation. It operates via a general perception–memory–reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 235B parameters, we demonstrate stronger models are more effective at leveraging feedback signals for self-improvement. However, this upper-bound adaptability remains fundamentally constrained by the models’ base capacity, and even the most advanced LLMs still fall short of human expert performance.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
Xiaozhe Li | Xinyu Fang | Shengyuan Ding | Yang Li | Linyang Li | Haodong Duan | Qingwen Liu | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic and puzzles. However, existing benchmarks evaluate only correctness, overlooking optimality—the ability to find the best solutions under constraints. We propose , the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility (Success Rate) and quality (Quality Ratio); and quality-aware rewards enabling continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o (29.6% SR, 14.6% QR). Beyond optimization, training on transfers to diverse tasks: mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction-following (+6.1%). Our analysis reveals quality-aware rewards improve solutions by 28.8% over binary rewards, and task diversity drives generalization more than data quantity—offering insights into RLVR scaling for complex reasoning.

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation
Shuang Cheng | Yihan Bian | Dawei Liu | Yuhua Jiang | Yihao Liu | Linfeng Zhang | Qian Yao | Zhongbo Tian | Wenhai Wang | Qipeng Guo | Kai Chen | Biqing Qi | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026

Autoregressive (AR) language modeling remains the dominant paradigm due to its dense supervision signal and highly optimized serving infrastructure, but its strictly causal, token-by-token decoding limits parallelism and non-causal modeling. While masked diffusion offers a promising path toward parallel generation, it faces two critical bottlenecks: training inefficiency stemming from sparse masked objectives, and high latency caused by iterative whole-sequence denoising. We present a systematic study of blockwise discrete diffusion, a pragmatic middle ground that preserves AR-compatible serving while enabling parallel intra-block generation. Our study proceeds in four steps: (i) a controlled, compute- and scale-matched comparison revealing that AR is a more effective backbone for blockwise hybrids than masked diffusion objectives; (ii) a scalable conversion recipe, SDAR, validating that AR models spanning 1.7B to 30B parameters can be adapted into block diffusion models with minimal compute while preserving backbone capabilities; and (iii) a systematic characterization of decoding dynamics, which reveals a virtuous cycle where larger models enable more aggressive parallel decoding, achieving theoretical speedups over 5× and wall-clock speedups of 2.3× on H200 GPUs in latency-critical regimes; and (iv) an investigation of local non-causal modeling capabilities, showing that SDAR’s local bidirectional attention overcomes causal bottlenecks in scientific domains (e.g., chemistry) and enables robust test-time scaling. We release the full model suite, the training framework, and our inference engines for further innovation in non-autoregressive generative paradigms.

PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor
Qianjun Pan | Junyi Wang | Jie Zhou | Yutao Yang | Junsong Li | Kaiyin Xu | Yougen Zhou | Yihan Li | JingYuan Zhao | Qin Chen | Ningning Zhou | Kai Chen | Liang He
Findings of the Association for Computational Linguistics: ACL 2026

To develop a reliable AI for psychological assessment, we introduce PsychEval, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges:**1) Can we train a highly realistic AI counselor?** Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. **2) How to train a multi-therapy AI counselor?** While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. **3) How to systematically evaluate an AI counselor?** We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To We also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset.Our datasets and evaluation framework are anonymously available at this repository.

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic
Yichuan Ma | Linyang Li | Yongkang Chen | Peiji Li | Xiaozhe Li | Qipeng Guo | Dahua Lin | Kai Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.

Powering Verifiable Learning via Automated Evolutionary Data Synthesis
He Du | Bowen Li | Aijun Yang | Siyang He | Qipeng Guo | Kai Chen | Dacheng Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reliable verifiable data has become a key driver of capability gains in modern language models, enabling stable reinforcement learning with verifiable rewards and effective distillation that transfers competence across math, coding, and agentic tasks. Yet constructing generalizable synthetic verifiable data remains difficult due to hallucination-prone generation, and weak or trivial verification artifacts that fail to separate strong from weak solutions. Existing approaches often rely on task-specific heuristics or post-hoc filters that do not transfer across domains and lack a principled, universal evaluator of verifiability. In this work, we introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework that, from minimal seed supervision, jointly synthesizes problems, diverse candidate solutions, and verification artifacts, and iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks. This pipeline upgrades filtering into principled synthesis: it reliably assembles coherent, verifiable training instances and generalizes without domain-specific rules. Our experiments demonstrate the effectiveness of the proposed approach under both RLVR and model distillation training paradigms. The results show that training with our synthesized data yields significant improvements on both the LiveCodeBench and AgentBench-OS tasks, highlighting the robust generalization of our framework.

SciExplore: Evaluating Autonomous Agents from Scientific Navigation to Information Integration
Yinhao Tang | Youqing Fang | Yanan Sun | Wenran Liu | Weiming Zhang | Bin Liu | Kuikun Liu | Wenwei Zhang | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2026

Scientific research involves complex information-seeking and reasoning workflows across heterogeneous sources. However, existing benchmarks primarily emphasize general-domain retrieval or static scientific question answering, and therefore fail to assess key capabilities required in realistic scientific research workflows. We introduce SciExplore, a benchmark designed to evaluate scientific information-seeking and reasoning capabilities of LLMs and agents. SciExplore comprises four task types covering 103 expert-curated tasks across more than ten scientific disciplines: scientific database navigation, ambiguous literature retrieval, missing reference completion, and cross-source structured knowledge synthesis, which probe progressively higher-level abilities from entity-level reasoning and document-level identification to evidence-level grounding and domain-level synthesis. We evaluate over ten state-of-the-art LLMs and autonomous agents on SciExplore, revealing substantial performance gaps with performance degrading sharply as task complexity increases and extremely low accuracy on the most challenging structured synthesis tasks. These results highlight significant limitations of current models and agents in realistic scientific information-seeking scenarios.

2025

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Shudong Liu | Hongwei Liu | Junnan Liu | Linchen Xiao | Songyang Gao | Chengqi Lyu | Yuzhe Gu | Wenwei Zhang | Derek F. Wong | Songyang Zhang | Kai Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of meta error patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate evaluation protocols and reinforcement learning research.

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Yicheng Chen | Yining Li | Kai Hu | Ma Zerun | HaochenYe HaochenYe | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025

Data quality and diversity are key to the construction of effective instruction-tuning datasets. With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Xiangyu Zhao | Shengyuan Ding | Zicheng Zhang | Haian Huang | Maosong Cao | Weiyun Wang | Jiaqi Wang | Xinyu Fang | Wenhai Wang | Guangtao Zhai | Haodong Duan | Hua Yang | Kai Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities.

Training Language Models to Critique With Multi-agent Feedback
Tian Lan | Wenwei Zhang | Chengqi Lyu | Shuaibin Li | Chen Xu | Heyan Huang | Dahua Lin | Xian-Ling Mao | Kai Chen
Findings of the Association for Computational Linguistics: EMNLP 2025

Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. While utilizing human annotation can enhance critique ability effectively, most recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4, which is more scalable and cost-effective.However, such model-generated critiques often suffer from inherent flaws due to the complexity of critique. Consequently, fine-tuning LLMs on these flawed critiques not only limits performance but also propagates errors into the learned model.To address this issue, we propose MultiCritique, a unified framework that leverages multi-agent feedback to improve critique ability in both the supervised fine-tuning (SFT) and reinforcement learning (RL) stages.In the SFT stage, MultiCritique aggregates high-quality multi-agent critiques through a fine-grained meta-critique mechanism. In the RL stage, preference critiques are constructed and refined by validating their contributions to revisions, thereby enhancing robustness of RL in improving critique ability.Based on MultiCritique, we construct SFT and RL datasets. Extensive experimental results on two benchmarks highlight the key benefits of our dataset, including superior quality, enhanced data efficiency, strong generalization on unseen tasks, and improvements in the general capability of LLMs.Notably, our fine-tuned 7B model significantly surpasses advanced 7B-13B models, approaching advanced 70B LLMs and GPT-4.Resources have been publicly available.

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
Maosong Cao | Taolin Zhang | Mo Li | Chuyu Zhang | Yunxin Liu | Haodong Duan | Songyang Zhang | Kai Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, the availability of high-quality human-annotated SFT data has become a significant bottleneck for LLMs, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a two-stage synthetic data generation framework that incorporates World Knowledge Trees and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to instruct model trained with RLHF. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling of synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.

Are Your LLMs Capable of Stable Reasoning?
Junnan Liu | Hongwei Liu | Linchen Xiao | Ziyi Wang | Kuikun Liu | Songyang Gao | Wenwei Zhang | Songyang Zhang | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025

The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce **G-Pass@**k, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model’s performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@k in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.

Scaling up the State Size of RNN LLMs for Long-Context Scenarios
Kai Liu | Jianfei Gao | Kai Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The Transformer architecture has become the standard LLM architecture due to its powerful self-attention mechanism. However, it suffers from quadratic computational complexity and linear memory complexity. RNN-based LLMs have been proposed as alternatives. Yet, RNN models struggle in long-context scenarios, making it challenging to replace self-attention with RNNs. We identify the state size as a critical bottleneck, which is significantly smaller than that of Transformers with a basic context length of 2k. However, simply increasing the state size significantly raises the number of parameters and lowers training efficiency. In this paper, we propose an efficient scaling method to scale the state size of RNN models to match the 2k context length of Transformers, with small parameters overhead. Experimental results demonstrate that scaling the state size significantly enhances long-context understanding. Retrieval performance scales almost linearly with state size, with a 454M model featuring an expanded state achieving performance comparable to a 1.47B model on FDA, a recall-intensive task. These findings highlight state scaling as a promising approach for advancing RNN-based LLMs.

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Chengxing Xie | Bowen Li | Chang Gao | He Du | Wai Lam | Difan Zou | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source framework designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other model to generate patches for the identified files. To mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches and train the two models of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models with scores of 22.0% and 30.2%. Furthermore, SWE-Fixer reaches state-of-the-art performance (24.7% on Lite and 32.8% on Verified) with PASS_TO_PASS (P2P) filtering. Additionally, our approach requires only two model calls per instance, making it significantly more efficient than existing methods. These results highlight the effectiveness of SWE-Fixer in real-world code-fixing scenarios.We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Yuhang Zang | Xiaoyi Dong | Pan Zhang | Yuhang Cao | Ziyu Liu | Shengyuan Ding | Shenxi Wu | Yubo Ma | Haodong Duan | Wenwei Zhang | Kai Chen | Dahua Lin | Jiaqi Wang
Findings of the Association for Computational Linguistics: ACL 2025

Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data.

Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law
Qiming Ge | Shuhao Xing | Songyang Gao | Yunhua Zhou | Yicheng Zou | Songyang Zhang | Zhi Chen | Hang Yan | Qi Zhang | Qipeng Guo | Kai Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model’s downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typically represents a cumulative penalty for predicted tokens, which are implicitly considered to have equal importance. Nevertheless, our studies have shown evidence that when considering different training data distributions, we cannot directly model the relationship between downstream capability and computation or token loss. To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model’s capabilities. Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.

CritiQ: Mining Data Quality Criteria from Human Preferences
Honglin Guo | Kai Lv | Qipeng Guo | Tianyi Liang | Zhiheng Xi | Demin Song | Qiuyinzhe Zhang | Yu Sun | Kai Chen | Xipeng Qiu | Tao Gui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, orcareful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier-based methods, verbal criteria are more interpretable and have greater reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.2 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.

Redundancy Principles for MLLMs Benchmarks
Zicheng Zhang | Xiangyu Zhao | Xinyu Fang | Chunyi Li | Xiaohong Liu | Xiongkuo Min | Haodong Duan | Kai Chen | Guangtao Zhai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs’ performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.

What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Zhi Chen | Qiguang Chen | Libo Qin | Qipeng Guo | Haijun Lv | Yicheng Zou | Hang Yan | Kai Chen | Dahua Lin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in large language models (LLMs) with extended context windows have significantly improved various tasks. To improve long-context capabilities, much work focuses on augmenting LLM’s capabilities with synthetic data. Existing methods often leverage the Self-Instruct framework to generate long-context instruction-tuning data. However, our preliminary experiments show that fewer than 35% of samples generated by Qwen-2-72B are multi-hop, and over 40% exhibit poor quality, limiting comprehensive understanding and further research. To address this, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, which integrates a quality verification agent, a single-hop question generation agent, a multiple question sampling strategy, and a multi-hop question merger agent. This framework significantly improves data quality, with high-quality, multi-hop, and diverse data. Furthermore, we conduct a thorough analysis of document selection, question merging, and validation techniques through extensive experiments across various models. Our results demonstrate that synthetic high-quality long-context instruction data can enhance model performance, surpassing even models trained on larger amounts of human-annotated data.

UnitCoder: Scalable Code Synthesis from Pre-training Corpora
Yichuan Ma | Yunfan Shao | Peiji Li | Demin Song | Qipeng Guo | Linyang Li | Xipeng Qiu | Kai Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Despite the abundant sources of code data, constructing high-quality training datasets at scale poses a significant challenge. Pre-training code data typically suffers from inconsistent data quality issues. Conversely, instruction-based methods which use a high-quality subset as seed samples suffer from limited task diversity. In this paper, we introduce UnitCoder, which directly supervises pre-training data quality through automatically generated unit tests, while ensuring the correctness via an iterative fix and refine flow. Code synthesized by UnitCoder benefits from both the diversity of pre-training corpora and the high quality ensured by unit test supervision. Our experiments demonstrate that models fine-tuned on our synthetic dataset exhibit consistent performance improvements. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released.

2024

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia
Zhejian Zhou | JIayu Wang | Dahua Lin | Kai Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Jingming Zhuo | Songyang Zhang | Xinyu Fang | Haodong Duan | Dahua Lin | Kai Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but their performance is highly sensitive to the prompts utilized. This variability poses challenges for accurate assessment and user satisfaction. Current research frequently overlooks instance-level prompt variations and their implications on subjective evaluations. To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs. ProSA incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms. Our extensive study, spanning multiple tasks, uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness. We observe that few-shot examples can alleviate this sensitivity issue, and subjective evaluations are also susceptible to prompt sensitivities, particularly in complex, reasoning-oriented tasks. Furthermore, our findings indicate that higher model confidence correlates with increased prompt robustness. We believe this work will serve as a helpful tool in studying prompt sensitivity of LLMs. The project is released at: https://github.com/open-compass/ProSA.

LawBench: Benchmarking Legal Knowledge of Large Language Models
Zhiwei Fei | Xiaoyu Shen | Dawei Zhu | Fengzhe Zhou | Zhuo Han | Alan Huang | Songyang Zhang | Kai Chen | Zhixin Yin | Zongwen Shen | Jidong Ge | Vincent Ng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present LawBench, the first evaluation benchmark composed of 20 tasks aimed to assess the ability of Large Language Models (LLMs) to perform Chinese legal-related tasks. LawBench is meticulously crafted to enable precise assessment of LLMs’ legal capabilities from three cognitive levels that correspond to the widely accepted Bloom’s cognitive taxonomy. Using LawBench, we present a comprehensive evaluation of 21 popular LLMs and the first comparative analysis of the empirical results in order to reveal their relative strengths and weaknesses. All data, model predictions and evaluation code are accessible from https://github.com/open-compass/LawBench.

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
Hongwei Liu | Zilong Zheng | Yuxuan Qiao | Haodong Duan | Zhiwei Fei | Fengzhe Zhou | Wenwei Zhang | Songyang Zhang | Dahua Lin | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2024

Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, which fall short in providing a holistic assessment of the LLMs’ math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model’s mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs’ mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context.

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
Zehui Chen | Kuikun Liu | Qiuchen Wang | Wenwei Zhang | Jiangning Liu | Dahua Lin | Kai Chen | Feng Zhao
Findings of the Association for Computational Linguistics: ACL 2024

Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem.This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents.Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code and models are available at https://github.com/InternLM/Agent-FLAN.

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
Xi Chen | Songyang Zhang | Qibing Bai | Kai Chen | Satoshi Nakamura
Findings of the Association for Computational Linguistics: ACL 2024

We introduces ***LLaST***, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation (E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs.We believe this effective method will serve as a strong baseline for speech translation and provide insights for futureimprovements of the LLM-based speech translation framework.

Co-authors

Shengyuan Ding 4

Xipeng Qiu (邱锡鹏) 2

Guangtao Zhai 2

Zicheng Zhang 2

Qiguang Chen (陈麒光) 1

Yongkang Chen 1

HaochenYe HaochenYe 1

He-Yan Huang (黄河燕) 1

Shudong Liu (刘树东) 1

Jiangning Liu 1

Xian-Ling Mao 1

Satoshi Nakamura 1

Derek F. Wong (黄辉) 1

Chengxing Xie 1

Linfeng Zhang 1

Qiuyinzhe Zhang 1

Weiming Zhang 1

JingYuan Zhao 1

Ningning Zhou 1

Jingming Zhuo 1

Venues