VANE: Guiding High-Value Exploration in RLVR via Outcome-Process Novelty Shaping

Xu He, Jialiang Guo, Fucheng Xiong, Haodong Zhao, Xingyang li, Ke Zeng, Xunliang Cai


Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) frequently suffers from mode collapse due to the inherent sparsity of feedback signals. While strategies such as entropy regularization introduce randomness, they lack directionality. Simply incorporating diversity rewards is overly one-sided and fails to identify potential logical errors or hallucinations. To address these limitations, we propose VANE (Value-Aligned Novelty Exploration), a method that simultaneously quantifies novelty across the outcome space (via reward or solution divergence) and the semantic process space (via semantic process divergence). Moreover, VANE employs a value-alignment mechanism that symmetrically amplifies scarce, high-quality solutions while explicitly penalizing diverse yet erroneous reasoning paths. Extensive experiments on models such as Qwen2.5-Math-7B across eight benchmarks—encompassing both large-scale mathematical reasoning and out-of-distribution (OOD) tasks—demonstrate the effectiveness and generalization of the proposed method.
Anthology ID:
2026.findings-acl.1434
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28721–28739
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1434/
DOI:
Bibkey:
Cite (ACL):
Xu He, Jialiang Guo, Fucheng Xiong, Haodong Zhao, Xingyang li, Ke Zeng, and Xunliang Cai. 2026. VANE: Guiding High-Value Exploration in RLVR via Outcome-Process Novelty Shaping. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28721–28739, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
VANE: Guiding High-Value Exploration in RLVR via Outcome-Process Novelty Shaping (He et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1434.pdf
Checklist:
 2026.findings-acl.1434.checklist.pdf