DARL: Encouraging Diverse Answers for General Reasoning without Verifiers

Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, Ruiming Tang


Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model’s ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate overall improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
Anthology ID:
2026.findings-acl.1530
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30649–30665
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1530/
DOI:
Bibkey:
Cite (ACL):
Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, and Ruiming Tang. 2026. DARL: Encouraging Diverse Answers for General Reasoning without Verifiers. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30649–30665, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
DARL: Encouraging Diverse Answers for General Reasoning without Verifiers (Huang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1530.pdf
Checklist:
 2026.findings-acl.1530.checklist.pdf