Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement

Ishan Jindal, Jayant Taneja, Badrinath Chandana, Vikas Kapur, Sachin Dev Sharma


Abstract
Large language models (LLMs) demonstrate strong reasoning capabilities but are expensive to run at inference time, limiting their practical deployment. We propose Offloaded Reasoning (OR), a modular strategy where a lightweight model generates intermediate reasoning traces that are then used by a larger model to produce the final answer. We further introduce Offloaded Reasoning with Refinement (ORR), where the large model first edits or improves the reasoning trace before answering. Unlike token-level acceleration methods, OR and ORR operate at the reasoning level and require no retraining of the large model. Experiments on GSM8K and Math500 show that OR achieves up to 8x faster inference than full large-model reasoning with minimal accuracy loss, while ORR recovers or exceeds full accuracy at substantially lower cost. Our results highlight the potential of modular, delegation-based reasoning for building more efficient and adaptable LLM systems.
Anthology ID:
2025.findings-emnlp.393
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7450–7458
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.393/
DOI:
10.18653/v1/2025.findings-emnlp.393
Bibkey:
Cite (ACL):
Ishan Jindal, Jayant Taneja, Badrinath Chandana, Vikas Kapur, and Sachin Dev Sharma. 2025. Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7450–7458, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Offloaded Reasoning: Efficient Inference for Large Language Models via Modular Reasoning and Refinement (Jindal et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.393.pdf
Checklist:
 2025.findings-emnlp.393.checklist.pdf