POP: Prefill-Only Pruning for Efficient Large Model Inference

Junhui He, Zhihui Fu, Jun Wang, Qingan Li


Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities.However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37× speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
Anthology ID:
2026.findings-acl.291
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5864–5877
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.291/
DOI:
Bibkey:
Cite (ACL):
Junhui He, Zhihui Fu, Jun Wang, and Qingan Li. 2026. POP: Prefill-Only Pruning for Efficient Large Model Inference. In Findings of the Association for Computational Linguistics: ACL 2026, pages 5864–5877, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
POP: Prefill-Only Pruning for Efficient Large Model Inference (He et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.291.pdf
Checklist:
 2026.findings-acl.291.checklist.pdf