Efficient Inference for Large Language Models –Algorithm, Model, and System

Xuefei Ning; Guohao Dai; Haoli Bai; Lu Hou; Yu Wang (王昱, 王雨)

Efficient Inference for Large Language Models –Algorithm, Model, and System

Xuefei Ning, Guohao Dai, Haoli Bai, Lu Hou, Yu Wang

Abstract

The inference of LLMs incurs high computational costs, memory access overhead, and memory usage, leading to inefficiencies in terms of latency, throughput, power consumption, and storage. To this end, this tutorial focuses on the increasingly important topic of Efficient Inference for LLMs and aims to provide a systematic understanding of key facts and methodologies from a designer’s perspective. We start by introducing the basic concepts of modern LLMs, software and hardware. Following this, we define the efficiency optimization problem. To equip the audience with a designer’s mindset, we briefly explain how to diagnose efficiency bottlenecks for a given workload on specific hardware. After introducing the basics, we will introduce our full-stack taxonomy of efficient inference methods for LLMs. We will walk through each category of methodology, using one to three representative methods as examples for each leaf subcategory, elaborating on the design logic behind each method and which inefficiency factors they primarily address. Finally, we will wrap up with a takeaway summary, and future research directions.

Anthology ID:: 2025.emnlp-tutorials.1
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Valentina Pyatkin, Andreas Vlachos
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–3
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-tutorials.1/
DOI:
Bibkey:
Cite (ACL):: Xuefei Ning, Guohao Dai, Haoli Bai, Lu Hou, and Yu Wang. 2025. Efficient Inference for Large Language Models –Algorithm, Model, and System. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 1–3, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Efficient Inference for Large Language Models –Algorithm, Model, and System (Ning et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-tutorials.1.pdf

PDF Cite Search Fix data