Jingqing Ruan
2026
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Ai Jian | Jingqing Ruan | Xing Ma | Dailin Li | Weipeng Zhang | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ai Jian | Jingqing Ruan | Xing Ma | Dailin Li | Weipeng Zhang | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences.Generative reward models (GRMs) provide greater interpretability than traditional scalar RMs, but they come with a critical trade-off: pairwise methods are hindered by a training-inference mismatch, while pointwise methods require expensive absolute annotations.To bridge this gap, we propose the Preference-aware Task-adaptive Reward Model (PaTaRM).Unlike prior approaches, PaTaRM enables robust pointwise training using readily available pairwise data via a novel Preference-Aware Reward (PAR) mechanism, eliminating the need for explicit rating labels. Furthermore, it incorporates a task-adaptive rubric system that dynamically generates instance-specific criteria for precise evaluation.Extensive experiments demonstrate that PaTaRM achieves an average relative improvement of 8.7% over the corresponding base models on RewardBench and RMBench across the Qwen3-8B and Qwen3-14B backbones.Crucially, when used as a reward model for downstream RLHF, it yields an average relative improvement of 13.6% over the corresponding base policies on IFEval and InfoBench, validating its effectiveness for policy alignment.Our code, data, and checkpoints are available at https://huggingface.co/AIJian/PaTaRM
Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Xiaoyun Zhang | Xiaojian Yuan | Di Huang | Wang You | Chen Hu | Jingqing Ruan | Kejiang Chen | Xing Hu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaoyun Zhang | Xiaojian Yuan | Di Huang | Wang You | Chen Hu | Jingqing Ruan | Kejiang Chen | Xing Hu
Findings of the Association for Computational Linguistics: ACL 2026
Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER) — a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability. Codes are available at https://anonymous.4open.science/r/AER-ACL .
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
FengXian Dong | Zhi Zheng | Xiao Han | Wei Chen | Jingqing Ruan | Tong Xu | Yong Chen | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
FengXian Dong | Zhi Zheng | Xiao Han | Wei Chen | Jingqing Ruan | Tong Xu | Yong Chen | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks. Recent Large Language Model (LLM)-based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed generation patterns and from the absence of feedback from the learning objective. To address these challenges, we propose a Memory-Augmented LLM-based Multi-Agent System (MALMAS) for automated feature generation. MALMAS decomposes the generation process into agents with distinct responsibilities, and a Router Agent activates an appropriate subset of agents per iteration, further broadening exploration of the feature space. We further integrate a memory module comprising procedural memory, feedback memory, and conceptual memory, enabling iterative refinement that adaptively guides subsequent feature generation and improves feature quality and diversity. Extensive experiments on multiple public datasets against state-of-the-art baselines demonstrate the effectiveness of our approach.
2025
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Haodong Zhao | Hao Li | Jiansong Chen | Ke Zeng | Xunliang Cai
Findings of the Association for Computational Linguistics: EMNLP 2025
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Haodong Zhao | Hao Li | Jiansong Chen | Ke Zeng | Xunliang Cai
Findings of the Association for Computational Linguistics: EMNLP 2025
Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of “Internal Self-Recovery Mechanism” where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models
Qi Liu | Jingqing Ruan | Hao Li | Haodong Zhao | Desheng Wang | Jiansong Chen | Wan Guanglu | Xunliang Cai | Zhi Zheng | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025
Qi Liu | Jingqing Ruan | Hao Li | Haodong Zhao | Desheng Wang | Jiansong Chen | Wan Guanglu | Xunliang Cai | Zhi Zheng | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025
Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO’s capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.
A Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Jiansong Chen | Ke Zeng | Xunliang Cai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Jiansong Chen | Ke Zeng | Xunliang Cai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Detecting abnormal events in real-world customer service dialogues is highly challenging due to the complexity of business data and the dynamic nature of customer interactions. Moreover, models must demonstrate strong out-of-domain (OOD) generalization to enable rapid adaptation across different business scenarios and maximize commercial value.In this work, we propose a novel Adaptive Perplexity-Aware Reinforcement Learning (APARL) framework that leverages the advanced reasoning capabilities of large language models for abnormal event detection. APARL introduces a dual-loop dynamic curriculum learning architecture, enabling the model to progressively focus on more challenging samples as its proficiency increases. This design effectively addresses performance bottlenecks and significantly enhances OOD transferability.Extensive evaluations on food delivery dialogue tasks show that our model achieves significantly enhanced adaptability and robustness, attaining the highest F1 score with an average improvement of 17.19%, and an average improvement of 9.59% in OOD transfer tests. This method provides a superior solution for industrial deployment of anomaly detection models, contributing to improved operational efficiency and commercial benefits.
2024
TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Industry Systems
Yilun Kong | Jingqing Ruan | YiHong Chen | Bin Zhang | Tianpeng Bao | Shi Shiwei | du Guo Qing | Xiaoru Hu | Hangyu Mao | Ziyue Li | Xingyu Zeng | Rui Zhao | Xueqian Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Yilun Kong | Jingqing Ruan | YiHong Chen | Bin Zhang | Tianpeng Bao | Shi Shiwei | du Guo Qing | Xiaoru Hu | Hangyu Mao | Ziyue Li | Xingyu Zeng | Rui Zhao | Xueqian Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools, such as weather and calculator APIs. However, real-world industrial systems present prevalent challenges in task planning and tool usage: numerous APIs in the real system make it intricate to invoke the appropriate one, while the inherent limitations of LLMs pose challenges in orchestrating an accurate sub-task sequence and API-calling order. This paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents in industry. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs among the extensive API set; (2) the Demo Selector retrieves task-level demonstrations, which is further used for in-context learning to aid LLMs in accurately decomposing subtasks and effectively invoking hard-to-distinguish APIs; (3) LLM Finetuner tunes a base LLM to enhance its capability for task planning and API calling. We validate our methods using a real-world industry system and an open-sourced academic dataset, demonstrating the efficacy of each individual component as well as the integrated framework. The code is available at here.
Search
Fix author
Co-authors
- Xunliang Cai 4
- Jiansong Chen 3
- Ke Zeng 3
- Xiaoyun Zhang 3
- Hao Li 2
- Xing Ma 2
- Tong Xu 2
- Haodong Zhao 2
- Zhi Zheng 2
- Yawen Zhu 2
- Tianpeng Bao 1
- Yihong Chen 1
- Kejiang Chen 1
- Wei Chen 1
- Yong Chen 1
- Enhong Chen 1
- FengXian Dong 1
- Wan Guanglu 1
- Xiao Han 1
- Xiaoru Hu 1
- Chen Hu 1
- Xing Hu 1
- Di Huang 1
- Ai Jian 1
- Yilun Kong 1
- Dailin Li 1
- Ziyue Li 1
- Qi Liu 1
- Xing Ma 1
- Hangyu Mao 1
- du Guo Qing 1
- Shi Shiwei 1
- Xueqian Wang 1
- Desheng Wang 1
- Wang You 1
- Xiaojian Yuan 1
- Xingyu Zeng 1
- Weipeng Zhang 1
- Bin Zhang 1
- Rui Zhao 1