Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths
Yew Ken Chia, Guizhen Chen, Weiwen Xu, Anh Tuan Luu, Soujanya Poria, Lidong Bing
Abstract
Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model’s overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at https://reasoning-paths.github.io.- Anthology ID:
- 2024.findings-emnlp.977
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16763–16780
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.977/
- DOI:
- 10.18653/v1/2024.findings-emnlp.977
- Cite (ACL):
- Yew Ken Chia, Guizhen Chen, Weiwen Xu, Anh Tuan Luu, Soujanya Poria, and Lidong Bing. 2024. Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16763–16780, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths (Chia et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.977.pdf