Reasoning-Guided Exploration for Online DPO
Zetian Hu, Shunyu Liu, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao
Abstract
Recent work has aimed to enhance the reasoning capabilities of language models, but these methods are often limited to domains with objectively verifiable answers. To overcome this limitation, we introduce Reasoning-Guided Exploration for Online DPO (RGE-DPO), a novel self-play framework designed to improve reasoning on general-domain data. RGE-DPO employs a dual-reward mechanism to evaluate responses by assessing: (1) reasoning quality using a self-rewarding rubric that provides structured evaluation of logical coherence, reasoning depth, and verification behaviors; and (2) response quality using an established reward model trained for aspects like helpfulness and correctness. These two orthogonal evaluation signals enable a comprehensive assessment of different response dimensions without conflating reasoning processes with response content. We then integrate these two evaluation signals based on a weighted ranking mechanism to construct the preference pairs, which ensures that responses with superior reasoning processes are preferred when response quality is comparable. Experiments demonstrate that RGE-DPO achieves substantial improvements in instruction-following benchmark while maintaining competitive performance on verifiable academic benchmarks.- Anthology ID:
- 2026.findings-acl.1370
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27526–27542
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1370/
- DOI:
- Cite (ACL):
- Zetian Hu, Shunyu Liu, Ting-En Lin, Fei Huang, Yongbin Li, and Dacheng Tao. 2026. Reasoning-Guided Exploration for Online DPO. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27526–27542, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Reasoning-Guided Exploration for Online DPO (Hu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1370.pdf