DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch; Lei Gao; Chaoyi Jiang; Murali Annavaram

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. One approach to reduce this latency is to evict entries from the key-value (KV) cache, thereby reducing the active context used in attention computation. However, such sparse attention methods suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the evolving importance of tokens over long derivations. We present DELTA, a training-free sparse attention mechanism that improves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of Δ-layers that identify salient tokens via aggregated head-level attention scores, and subsequent sparse-attention layers that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to 4.25× and delivering 1.54× end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning. The code is available at https://github.com/hoenza/DELTA.

Anthology ID:: 2026.findings-acl.558
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11502–11518
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.558/
DOI:
Bibkey:
Cite (ACL):: Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, and Murali Annavaram. 2026. DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 11502–11518, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning (Zarch et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.558.pdf
Checklist:: 2026.findings-acl.558.checklist.pdf

PDF Cite Search Checklist Fix data