Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models

Chien Van Nguyen; Ryan A. Rossi; Linh Ngo Van; Franck Dernoncourt; Thien Huu Nguyen

Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models

Chien Van Nguyen, Ryan A. Rossi, Linh Ngo Van, Franck Dernoncourt, Thien Huu Nguyen

Abstract

Transformer inference becomes increasingly memory-bound as the Key–Value (KV) cache grows linearly with sequence length. While subquadratic architectures offer constant-memory inference, they rely on aggressive state compression that degrades performance on complex reasoning tasks. We propose Octopus, a framework that confers fixed-memory inference onto pretrained Transformers without the information loss of linearization. Octopus retrofits attention layers with Gated Selective Attention, a learnable module that enforces an adaptive sparsity policy over the context history. By dynamically scoring and retaining only high-utility KV states, this mechanism transforms the unbounded cache into a compact, evolving memory budget that filters out uninformative noise. Empirically, on the GSM8K benchmark, it outperforms state-of-the-art linearized baselines by over 36 points under identical memory constraints. Remarkably, Octopus also surpasses its own full-cache teacher, demonstrating that learned sparse retention serves as an effective regularizer for long-horizon reasoning.

Anthology ID:: 2026.acl-long.1631
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35311–35323
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1631/
DOI:
Bibkey:
Cite (ACL):: Chien Van Nguyen, Ryan A. Rossi, Linh Ngo Van, Franck Dernoncourt, and Thien Huu Nguyen. 2026. Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35311–35323, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Octopus: Gated Selective Attention for Memory-Bounded Long-Context Inference in Large Language Models (Van Nguyen et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1631.pdf
Checklist:: 2026.acl-long.1631.checklist.pdf

PDF Cite Search Checklist Fix data