DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Hanzhi Zhang; Heng Fan; Kewei Sha; Yan Huang; Yunhe Feng

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng

Abstract

Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.

Anthology ID:: 2025.findings-acl.242
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4663–4676
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.242/
DOI:
Bibkey:
Cite (ACL):: Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, and Yunhe Feng. 2025. DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4663–4676, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.242.pdf

PDF Cite Search Fix data