Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models

Mengdie Flora Wang; Haochen Xie; Mun Young Kim; Baishali Chaudhury; Meghana Ashok; Suren Gunturu; Sungmin Hong; Jae Oh Woo

Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models

Mengdie Flora Wang, Haochen Xie, Mun Young Kim, Baishali Chaudhury, Meghana Ashok, Suren Gunturu, Sungmin Hong, Jae Oh Woo

Abstract

Contemporary advancements in language model reasoning typically require computationally intensive reinforcement learning (RL) and massive datasets, creating barriers for resource-constrained teams. In this work, we demonstrate that high-quality, iterative training on minimal data can rival modern RL approaches. We introduce a resource-efficient framework that combines Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT) with selective guidance from larger models, iteratively refining solutions through a "reflect, rewrite, repeat" cycle (R³). Using Qwen 2.5 7B and Qwen 2.5 Math 7B as base models, our method shows meaningful performance improvements across arithmetic, symbolic and cognitive reasoning benchmarks—including GSM8K (83.1% → 88.6%), AIME’25@10 (20.0% → 30.0%) and LastLetterConcat (40.7% → 53.3%) problems. The model-agnostic nature of our R³ framework is further demonstrated through substantial improvements when applied to Mistral and LLaMA-based models. Remarkably, these gains are achieved using mere 700 basic arithmetic training samples, in stark contrast to the hundreds of thousands of examples typically required by RL-based systems. Our results suggest that reasoning improvements need not strictly depend on large-scale data. By emphasizing strategically curated training grounded in foundational principles, we achieve competitive generalization with minimal resource overhead. Our R³ pipeline also generates high-quality SFT data with high-fidelity reasoning traces as byproduct, further enabling scalable and annotation-free fine-tuning. Code is available.[<https://github.com/aws-samples/sample-for-reflect-rewrite-repeat>]

Anthology ID:: 2026.findings-eacl.69
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1341–1363
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.69/
DOI:
Bibkey:
Cite (ACL):: Mengdie Flora Wang, Haochen Xie, Mun Young Kim, Baishali Chaudhury, Meghana Ashok, Suren Gunturu, Sungmin Hong, and Jae Oh Woo. 2026. Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1341–1363, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models (Wang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.69.pdf
Checklist:: 2026.findings-eacl.69.checklist.pdf

PDF Cite Search Checklist Fix data