NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, Yulan He


Abstract
Recent advances, such as DeepSeek R1-Zero, highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model’s output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding, where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train.In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7%. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.
Anthology ID:
2025.emnlp-main.378
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7450–7469
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.378/
DOI:
Bibkey:
Cite (ACL):
Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, and Yulan He. 2025. NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7450–7469, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning (Liu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.378.pdf
Checklist:
 2025.emnlp-main.378.checklist.pdf