Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning
Changshuo Zhang, Ang Gao, Xiao Zhang, Yong Liu, Deyang Li, Fangchao Liu, Xinyu Zhang
Abstract
In-context learning (ICL) performance heavily relies on the quality and ordering of demonstrations. Iterative selection (IS) is a promising approach to address this issue, but existing IS methods face two key challenges: the oversimplification of process reward signals that guide intermediate steps (often using single-dimensional metrics) and the lack of outcome reward signals that directly optimize final-task accuracy (relying solely on binary terminal feedback like correct/incorrect predictions). To address these issues, we propose a reinforcement learning method R-Mix which models iterative demonstration selection as a Markov Decision Process (MDP), crafting hybrid reward signals — combining outcome-based accuracy signals (i.e., outcome rewards) with process-oriented signals (i.e, process rewards) like stepwise influence and label entropy improvement. Our analysis reveals a positive but trade-off relationship between outcome rewards and process rewards, underscoring the importance of both components for effective policy optimization. We further introduce a dual-head policy architecture that explicitly decouples input-semantic relevance and label-content compatibility. Experiments across NLP benchmarks demonstrate superior performance over state-of-the-art methods, with ablation studies validating the necessity of both reward components and architectural disentanglement. Our work has deeply explored the effective potential of ICL through demonstration selection.- Anthology ID:
- 2025.findings-emnlp.234
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4373–4383
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.234/
- DOI:
- 10.18653/v1/2025.findings-emnlp.234
- Cite (ACL):
- Changshuo Zhang, Ang Gao, Xiao Zhang, Yong Liu, Deyang Li, Fangchao Liu, and Xinyu Zhang. 2025. Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4373–4383, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Reward Mixology: Crafting Hybrid Signals for Reinforcement Learning Driven In-Context Learning (Zhang et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.234.pdf