Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement
Ishan Jindal, Jayant Taneja, Badrinath Chandana, Vikas Kapur, Sachin Dev Sharma
Abstract
Large language models (LLMs) demonstrate strong reasoning capabilities but are expensive to run at inference time, limiting their practical deployment. We propose Offloaded Reasoning (OR), a modular strategy where a lightweight model generates intermediate reasoning traces that are then used by a larger model to produce the final answer. We further introduce Offloaded Reasoning with Refinement (ORR), where the large model first edits or improves the reasoning trace before answering. Unlike token-level acceleration methods, OR and ORR operate at the reasoning level and require no retraining of the large model. Experiments on GSM8K and Math500 show that OR achieves up to 8x faster inference than full large-model reasoning with minimal accuracy loss, while ORR recovers or exceeds full accuracy at substantially lower cost. Our results highlight the potential of modular, delegation-based reasoning for building more efficient and adaptable LLM systems.- Anthology ID:
- 2025.findings-emnlp.393
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7450–7458
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.393/
- DOI:
- 10.18653/v1/2025.findings-emnlp.393
- Cite (ACL):
- Ishan Jindal, Jayant Taneja, Badrinath Chandana, Vikas Kapur, and Sachin Dev Sharma. 2025. Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7450–7458, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement (Jindal et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.393.pdf