Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models
Chien Van Nguyen, Ryan A. Rossi, Linh Ngo Van, Franck Dernoncourt, Thien Huu Nguyen
Abstract
Transformer inference becomes increasingly memory-bound as the Key–Value (KV) cache grows linearly with sequence length. While subquadratic architectures offer constant-memory inference, they rely on aggressive state compression that degrades performance on complex reasoning tasks. We propose Octopus, a framework that confers fixed-memory inference onto pretrained Transformers without the information loss of linearization. Octopus retrofits attention layers with Gated Selective Attention, a learnable module that enforces an adaptive sparsity policy over the context history. By dynamically scoring and retaining only high-utility KV states, this mechanism transforms the unbounded cache into a compact, evolving memory budget that filters out uninformative noise. Empirically, on the GSM8K benchmark, it outperforms state-of-the-art linearized baselines by over 36 points under identical memory constraints. Remarkably, Octopus also surpasses its own full-cache teacher, demonstrating that learned sparse retention serves as an effective regularizer for long-horizon reasoning.- Anthology ID:
- 2026.acl-long.1631
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 35311–35323
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1631/
- DOI:
- Cite (ACL):
- Chien Van Nguyen, Ryan A. Rossi, Linh Ngo Van, Franck Dernoncourt, and Thien Huu Nguyen. 2026. Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35311–35323, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models (Van Nguyen et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1631.pdf