ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training
Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Boxing Chen, Hyock Ju Kwon, Walid Ahmed, Yang Liu
Abstract
This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77% higher token-per-second throughput during training, up to 16% higher Model FLOPs Utilization (MFU), and up to 14% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.- Anthology ID:
- 2025.emnlp-industry.156
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou (China)
- Editors:
- Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2252–2269
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-industry.156/
- DOI:
- 10.18653/v1/2025.emnlp-industry.156
- Cite (ACL):
- Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Boxing Chen, Hyock Ju Kwon, Walid Ahmed, and Yang Liu. 2025. ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2252–2269, Suzhou (China). Association for Computational Linguistics.
- Cite (Informal):
- ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training (Dialameh et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-industry.156.pdf