Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Nearchos Potamitis; Lars Henning Klein; Bardia Mohammadi; Chongyang Xu; Attreyee Mukherjee; Niket Tandon; Laurent Bindschaedler; Akhil Arora

doi:10.18653/v1/2025.findings-emnlp.1402

Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Nearchos Potamitis, Lars Henning Klein, Bardia Mohammadi, Chongyang Xu, Attreyee Mukherjee, Niket Tandon, Laurent Bindschaedler, Akhil Arora

Abstract

Inference constitutes the majority of costs throughout the lifecycle of a large language model (LLM). While numerous LLM inference engines focusing primarily on low-level optimizations have been developed, there is a scarcity of non-intrusive client-side frameworks that perform high-level optimizations. In this paper, we introduce Cache Saver, a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, thereby integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. The key novelty is a *namespace-aware list-valued cache* that ensures *statistical integrity* of LLM responses by generating *i.i.d.* responses within a namespace as well as ensuring *reproducibility*. Moreover, as a direct consequence of operating at a high level, Cache Saver supports both local and online models. We conduct extensive experiments with five representative state-of-the-art reasoning strategies, five diverse benchmark tasks, and three different LLMs. On average across all methods, tasks, and LLMs, Cache Saver reduces cost by ≃ 25% and CO₂ by ≃ 35%. Notably, Cache Saver excels in practical machine learning scenarios such as benchmarking across multiple methods or conducting ablation analysis of a specific method, obtaining substantial cost and carbon footprint reduction of ≃ 60%. Cache Saver is publicly available at [https://github.com/au-clan/cachesaver](https://github.com/au-clan/cachesaver).

Anthology ID:: 2025.findings-emnlp.1402
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25703–25724
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1402/
DOI:: 10.18653/v1/2025.findings-emnlp.1402
Bibkey:
Cite (ACL):: Nearchos Potamitis, Lars Henning Klein, Bardia Mohammadi, Chongyang Xu, Attreyee Mukherjee, Niket Tandon, Laurent Bindschaedler, and Akhil Arora. 2025. Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25703–25724, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference (Potamitis et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1402.pdf
Checklist:: 2025.findings-emnlp.1402.checklist.pdf

PDF Cite Search Checklist Fix data