Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning

Changshuo Zhang, Ang Gao, Xiao Zhang, Yong Liu, Deyang Li, Fangchao Liu, Xinyu Zhang


Abstract
In-context learning (ICL) performance heavily relies on the quality and ordering of demonstrations. Iterative selection (IS) is a promising approach to address this issue, but existing IS methods face two key challenges: the oversimplification of process reward signals that guide intermediate steps (often using single-dimensional metrics) and the lack of outcome reward signals that directly optimize final-task accuracy (relying solely on binary terminal feedback like correct/incorrect predictions). To address these issues, we propose a reinforcement learning method R-Mix which models iterative demonstration selection as a Markov Decision Process (MDP), crafting hybrid reward signals — combining outcome-based accuracy signals (i.e., outcome rewards) with process-oriented signals (i.e, process rewards) like stepwise influence and label entropy improvement. Our analysis reveals a positive but trade-off relationship between outcome rewards and process rewards, underscoring the importance of both components for effective policy optimization. We further introduce a dual-head policy architecture that explicitly decouples input-semantic relevance and label-content compatibility. Experiments across NLP benchmarks demonstrate superior performance over state-of-the-art methods, with ablation studies validating the necessity of both reward components and architectural disentanglement. Our work has deeply explored the effective potential of ICL through demonstration selection.
Anthology ID:
2025.findings-emnlp.234
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4373–4383
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.234/
DOI:
10.18653/v1/2025.findings-emnlp.234
Bibkey:
Cite (ACL):
Changshuo Zhang, Ang Gao, Xiao Zhang, Yong Liu, Deyang Li, Fangchao Liu, and Xinyu Zhang. 2025. Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4373–4383, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning (Zhang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.234.pdf
Checklist:
 2025.findings-emnlp.234.checklist.pdf