Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning

Changshuo Zhang; Ang Gao; Xiao Zhang (张晓); Yong Liu; Deyang Li; Fangchao Liu; Xinyu Zhang

doi:10.18653/v1/2025.findings-emnlp.234

Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning

Changshuo Zhang, Ang Gao, Xiao Zhang, Yong Liu, Deyang Li, Fangchao Liu, Xinyu Zhang

Abstract

In-context learning (ICL) performance heavily relies on the quality and ordering of demonstrations. Iterative selection (IS) is a promising approach to address this issue, but existing IS methods face two key challenges: the oversimplification of process reward signals that guide intermediate steps (often using single-dimensional metrics) and the lack of outcome reward signals that directly optimize final-task accuracy (relying solely on binary terminal feedback like correct/incorrect predictions). To address these issues, we propose a reinforcement learning method R-Mix which models iterative demonstration selection as a Markov Decision Process (MDP), crafting hybrid reward signals — combining outcome-based accuracy signals (i.e., outcome rewards) with process-oriented signals (i.e, process rewards) like stepwise influence and label entropy improvement. Our analysis reveals a positive but trade-off relationship between outcome rewards and process rewards, underscoring the importance of both components for effective policy optimization. We further introduce a dual-head policy architecture that explicitly decouples input-semantic relevance and label-content compatibility. Experiments across NLP benchmarks demonstrate superior performance over state-of-the-art methods, with ablation studies validating the necessity of both reward components and architectural disentanglement. Our work has deeply explored the effective potential of ICL through demonstration selection.

Anthology ID:: 2025.findings-emnlp.234
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4373–4383
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.234/
DOI:: 10.18653/v1/2025.findings-emnlp.234
Bibkey:
Cite (ACL):: Changshuo Zhang, Ang Gao, Xiao Zhang, Yong Liu, Deyang Li, Fangchao Liu, and Xinyu Zhang. 2025. Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4373–4383, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.234.pdf
Checklist:: 2025.findings-emnlp.234.checklist.pdf

PDF Cite Search Checklist Fix data