Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models

Chien Van Nguyen, Ryan A. Rossi, Linh Ngo Van, Franck Dernoncourt, Thien Huu Nguyen


Abstract
Transformer inference becomes increasingly memory-bound as the Key–Value (KV) cache grows linearly with sequence length. While subquadratic architectures offer constant-memory inference, they rely on aggressive state compression that degrades performance on complex reasoning tasks. We propose Octopus, a framework that confers fixed-memory inference onto pretrained Transformers without the information loss of linearization. Octopus retrofits attention layers with Gated Selective Attention, a learnable module that enforces an adaptive sparsity policy over the context history. By dynamically scoring and retaining only high-utility KV states, this mechanism transforms the unbounded cache into a compact, evolving memory budget that filters out uninformative noise. Empirically, on the GSM8K benchmark, it outperforms state-of-the-art linearized baselines by over 36 points under identical memory constraints. Remarkably, Octopus also surpasses its own full-cache teacher, demonstrating that learned sparse retention serves as an effective regularizer for long-horizon reasoning.
Anthology ID:
2026.acl-long.1631
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35311–35323
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1631/
DOI:
Bibkey:
Cite (ACL):
Chien Van Nguyen, Ryan A. Rossi, Linh Ngo Van, Franck Dernoncourt, and Thien Huu Nguyen. 2026. Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35311–35323, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models (Van Nguyen et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1631.pdf
Checklist:
 2026.acl-long.1631.checklist.pdf