VANE: Guiding High-Value Exploration in RLVR via Outcome-Process Novelty Shaping
Xu He, Jialiang Guo, Fucheng Xiong, Haodong Zhao, Xingyang li, Ke Zeng, Xunliang Cai
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) frequently suffers from mode collapse due to the inherent sparsity of feedback signals. While strategies such as entropy regularization introduce randomness, they lack directionality. Simply incorporating diversity rewards is overly one-sided and fails to identify potential logical errors or hallucinations. To address these limitations, we propose VANE (Value-Aligned Novelty Exploration), a method that simultaneously quantifies novelty across the outcome space (via reward or solution divergence) and the semantic process space (via semantic process divergence). Moreover, VANE employs a value-alignment mechanism that symmetrically amplifies scarce, high-quality solutions while explicitly penalizing diverse yet erroneous reasoning paths. Extensive experiments on models such as Qwen2.5-Math-7B across eight benchmarks—encompassing both large-scale mathematical reasoning and out-of-distribution (OOD) tasks—demonstrate the effectiveness and generalization of the proposed method.- Anthology ID:
- 2026.findings-acl.1434
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 28721–28739
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1434/
- DOI:
- Cite (ACL):
- Xu He, Jialiang Guo, Fucheng Xiong, Haodong Zhao, Xingyang li, Ke Zeng, and Xunliang Cai. 2026. VANE: Guiding High-Value Exploration in RLVR via Outcome-Process Novelty Shaping. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28721–28739, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- VANE: Guiding High-Value Exploration in RLVR via Outcome-Process Novelty Shaping (He et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1434.pdf