Jingqing Ruan
2026
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Ai Jian | Jingqing Ruan | Xing Ma | Dailin Li | Weipeng Zhang | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ai Jian | Jingqing Ruan | Xing Ma | Dailin Li | Weipeng Zhang | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences.Generative reward models (GRMs) provide greater interpretability than traditional scalar RMs, but they come with a critical trade-off: pairwise methods are hindered by a training-inference mismatch, while pointwise methods require expensive absolute annotations.To bridge this gap, we propose the Preference-aware Task-adaptive Reward Model (PaTaRM).Unlike prior approaches, PaTaRM enables robust pointwise training using readily available pairwise data via a novel Preference-Aware Reward (PAR) mechanism, eliminating the need for explicit rating labels. Furthermore, it incorporates a task-adaptive rubric system that dynamically generates instance-specific criteria for precise evaluation.Extensive experiments demonstrate that PaTaRM achieves an average relative improvement of 8.7% over the corresponding base models on RewardBench and RMBench across the Qwen3-8B and Qwen3-14B backbones.Crucially, when used as a reward model for downstream RLHF, it yields an average relative improvement of 13.6% over the corresponding base policies on IFEval and InfoBench, validating its effectiveness for policy alignment.Our code, data, and checkpoints are available at https://huggingface.co/AIJian/PaTaRM
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
FengXian Dong | Zhi Zheng | Xiao Han | Wei Chen | Jingqing Ruan | Tong Xu | Yong Chen | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
FengXian Dong | Zhi Zheng | Xiao Han | Wei Chen | Jingqing Ruan | Tong Xu | Yong Chen | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks. Recent Large Language Model (LLM)-based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed generation patterns and from the absence of feedback from the learning objective. To address these challenges, we propose a Memory-Augmented LLM-based Multi-Agent System (MALMAS) for automated feature generation. MALMAS decomposes the generation process into agents with distinct responsibilities, and a Router Agent activates an appropriate subset of agents per iteration, further broadening exploration of the feature space. We further integrate a memory module comprising procedural memory, feedback memory, and conceptual memory, enabling iterative refinement that adaptively guides subsequent feature generation and improves feature quality and diversity. Extensive experiments on multiple public datasets against state-of-the-art baselines demonstrate the effectiveness of our approach.
Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents
Haojin Yang | Ai Jian | Yiwei Wang | Xinyue Huang | Weipeng Zhang | Ke Zeng | Xunliang Cai | Jingqing Ruan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Haojin Yang | Ai Jian | Yiwei Wang | Xinyue Huang | Weipeng Zhang | Ke Zeng | Xunliang Cai | Jingqing Ruan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking.To address this issue, we propose **Dual-Horizon Credit Assignment (DuCA)**, a framework that disentangles optimization across time scales. Its core, **Horizon-Independent Advantage Normalization (HIAN)**, separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update.Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
2025
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Haodong Zhao | Hao Li | Jiansong Chen | Ke Zeng | Xunliang Cai
Findings of the Association for Computational Linguistics: EMNLP 2025
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Haodong Zhao | Hao Li | Jiansong Chen | Ke Zeng | Xunliang Cai
Findings of the Association for Computational Linguistics: EMNLP 2025
Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of “Internal Self-Recovery Mechanism” where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models
Qi Liu | Jingqing Ruan | Hao Li | Haodong Zhao | Desheng Wang | Jiansong Chen | Wan Guanglu | Xunliang Cai | Zhi Zheng | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025
Qi Liu | Jingqing Ruan | Hao Li | Haodong Zhao | Desheng Wang | Jiansong Chen | Wan Guanglu | Xunliang Cai | Zhi Zheng | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025
Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO’s capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.
A Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Jiansong Chen | Ke Zeng | Xunliang Cai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Jiansong Chen | Ke Zeng | Xunliang Cai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Detecting abnormal events in real-world customer service dialogues is highly challenging due to the complexity of business data and the dynamic nature of customer interactions. Moreover, models must demonstrate strong out-of-domain (OOD) generalization to enable rapid adaptation across different business scenarios and maximize commercial value.In this work, we propose a novel Adaptive Perplexity-Aware Reinforcement Learning (APARL) framework that leverages the advanced reasoning capabilities of large language models for abnormal event detection. APARL introduces a dual-loop dynamic curriculum learning architecture, enabling the model to progressively focus on more challenging samples as its proficiency increases. This design effectively addresses performance bottlenecks and significantly enhances OOD transferability.Extensive evaluations on food delivery dialogue tasks show that our model achieves significantly enhanced adaptability and robustness, attaining the highest F1 score with an average improvement of 17.19%, and an average improvement of 9.59% in OOD transfer tests. This method provides a superior solution for industrial deployment of anomaly detection models, contributing to improved operational efficiency and commercial benefits.
2024
TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Industry Systems
Yilun Kong | Jingqing Ruan | YiHong Chen | Bin Zhang | Tianpeng Bao | Shi Shiwei | du Guo Qing | Xiaoru Hu | Hangyu Mao | Ziyue Li | Xingyu Zeng | Rui Zhao | Xueqian Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Yilun Kong | Jingqing Ruan | YiHong Chen | Bin Zhang | Tianpeng Bao | Shi Shiwei | du Guo Qing | Xiaoru Hu | Hangyu Mao | Ziyue Li | Xingyu Zeng | Rui Zhao | Xueqian Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools, such as weather and calculator APIs. However, real-world industrial systems present prevalent challenges in task planning and tool usage: numerous APIs in the real system make it intricate to invoke the appropriate one, while the inherent limitations of LLMs pose challenges in orchestrating an accurate sub-task sequence and API-calling order. This paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents in industry. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs among the extensive API set; (2) the Demo Selector retrieves task-level demonstrations, which is further used for in-context learning to aid LLMs in accurately decomposing subtasks and effectively invoking hard-to-distinguish APIs; (3) LLM Finetuner tunes a base LLM to enhance its capability for task planning and API calling. We validate our methods using a real-world industry system and an open-sourced academic dataset, demonstrating the efficacy of each individual component as well as the integrated framework. The code is available at here.
Search
Fix author
Co-authors
- Xunliang Cai 5
- Ke Zeng 4
- Jiansong Chen 3
- Ai Jian 2
- Hao Li 2
- Xing Ma 2
- Tong Xu 2
- Weipeng Zhang 2
- Xiaoyun Zhang 2
- Haodong Zhao 2
- Zhi Zheng 2
- Yawen Zhu 2
- Tianpeng Bao 1
- Enhong Chen 1
- Wei Chen 1
- Yihong Chen 1
- Yong Chen 1
- FengXian Dong 1
- Wan Guanglu 1
- Xiao Han 1
- Xiaoru Hu 1
- Xinyue Huang 1
- Yilun Kong 1
- Dailin Li 1
- Ziyue Li 1
- Qi Liu 1
- Xing Ma 1
- Hangyu Mao 1
- du Guo Qing 1
- Shi Shiwei 1
- Desheng Wang 1
- Xueqian Wang 1
- Yiwei Wang 1
- Haojin Yang 1
- Xingyu Zeng 1
- Bin Zhang 1
- Rui Zhao 1