LLMs on a Budget? Say HOLA

Zohaib Hasan Siddiqui; Jiechao Gao; Ebad Shabbir; Mohammad Anas Azeez; Rafiq Ali; Gautam Siddharth Kashyap; Usman Naseem

LLMs on a Budget? Say HOLA

Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem

Abstract

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands—posing a barrier for real-time applications in industries like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and Retrieval-Augmented Generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with Lo-Bi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: +17.6% EMA on GSM8K, +10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano—proving both scalable and production-ready. Our code is available at: https://github.com/zohaibhasan066/HOLA_Codebase

Anthology ID:: 2025.emnlp-industry.71
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1035–1043
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.71/
DOI:
Bibkey:
Cite (ACL):: Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, and Usman Naseem. 2025. LLMs on a Budget? Say HOLA. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1035–1043, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: LLMs on a Budget? Say HOLA (Siddiqui et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.71.pdf

PDF Cite Search Fix data