Task-aware Block Pruning with Output Distribution Signals for Large Language Models

Song-ha Jo, Youngrok Ko, Sang-goo Lee, Jinseok Seol


Abstract
Large language models (LLMs) provide excellent performance, but their practical deployment is limited by the substantial compute and memory demands of large models and the latency of auto-regressive decoding. To mitigate these inefficiencies, block pruning reduces the number of executed transformer blocks, effectively lowering latency while preserving architectural coherence. However, existing methods typically rely on representation similarity or computationally expensive sensitivity analyses to estimate block importance, thereby neglecting task-aware model behavior. To address this limitation, we introduce Task-aware Block Pruning (TaBP), a novel approach that directly captures task-specific inference dynamics by quantifying block-level uncertainty from the statistics of each block’s early-exited output distribution on a calibration dataset. Since output distributions reflect the model’s confidence and decision uncertainty conditioned on downstream tasks, these statistics provide a principled signal for identifying blocks that are less critical for task performance. Extensive experiments demonstrate that TaBP preserves downstream task performance while substantially reducing inference latency and computational cost, without relying on cost-heavy sensitivity analyses. To facilitate reproducibility and further research, we release our implementation of TaBP on [GitHub](https://github.com/Song-haJo/TaBP).
Anthology ID:
2026.findings-eacl.320
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6089–6107
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.320/
DOI:
Bibkey:
Cite (ACL):
Song-ha Jo, Youngrok Ko, Sang-goo Lee, and Jinseok Seol. 2026. Task-aware Block Pruning with Output Distribution Signals for Large Language Models. In Findings of the Association for Computational Linguistics: EACL 2026, pages 6089–6107, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Task-aware Block Pruning with Output Distribution Signals for Large Language Models (Jo et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.320.pdf
Checklist:
 2026.findings-eacl.320.checklist.pdf