Abstract
Teaching small-scale language models to perform math reasoning is a valuable yet challenging task. Besides obtaining labeled data from human experts, one of the most common ways to collect high-quality data is by sampling from a larger and more powerful language model. Although previous works have demonstrated the effectiveness of this method, such a knowledge distillation paradigm can be costly and unstable, especially considering that many large language models, such as GPT-4, are closed-sourced, proprietary, and their behaviors are unpredictable. In this work, to avoid relying on outputs from large models, we demonstrate that the reasoning abilities of small-scale language models can be enhanced through self-training, which involves training models with their own outputs. We also show that the vanilla self-training can be further augmented by an alignment algorithm, direct preference optimization (DPO). We empirically found that models trained with the DPO objective are capable of making better generations that largely benefit multi-turn self-training. The experiments show our models outperform the state-of-the-art models with comparable sizes on a series of downstream math reasoning tasks with minimal resource requirements.- Anthology ID:
- 2024.acl-long.643
- Volume:
- Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11917–11928
- Language:
- URL:
- https://aclanthology.org/2024.acl-long.643
- DOI:
- 10.18653/v1/2024.acl-long.643
- Cite (ACL):
- Tianduo Wang, Shichen Li, and Wei Lu. 2024. Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11917–11928, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning (Wang et al., ACL 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.acl-long.643.pdf