Daxiong Luo
2025
EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
Jiyu Chen
|
Shuang Peng
|
Daxiong Luo
|
Fan Yang
|
Renshou Wu
|
Fangyuan Li
|
Xiaoxin Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.
Search
Fix author
Co-authors
- Jiyu Chen 1
- Xiaoxin Chen 1
- Fangyuan Li 1
- Shuang Peng 1
- Renshou Wu 1
- show all...
- Fan Yang 1
Venues
- acl1