Runsong Zhao
2026
CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
Runsong Zhao | Shilei Liu | Jiwei Tang | Langming Liu | Haibin Chen | Weidong Zhang | Yujin Yuan | Tong Xiao | JingBo Zhu | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Runsong Zhao | Shilei Liu | Jiwei Tang | Langming Liu | Haibin Chen | Weidong Zhang | Yujin Yuan | Tong Xiao | JingBo Zhu | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the **Co**llaborative **Me**mory **T**ransformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks, supported by a novel layer-level pipeline parallel training strategy that enables fine-tuning on extremely long contexts. The code is available at: https://github.com/LivingFutureLab/Comet
Read As Human: Compressing Context via Parallelizable Close Reading and Skimming
Jiwei Tang | Shilei Liu | Zhicheng Zhang | Qingsong Lv | Runsong Zhao | Tingwei Lu | Langming Liu | Haibin Chen | Yujin Yuan | Hai-Tao Zheng | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiwei Tang | Shilei Liu | Zhicheng Zhang | Qingsong Lv | Runsong Zhao | Tingwei Lu | Langming Liu | Haibin Chen | Yujin Yuan | Hai-Tao Zheng | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query–segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).
2025
Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models
Runsong Zhao | Xin Liu | Xinyu Liu | Pengcheng Huang | Chunyang Xiao | Tong Xiao | JingBo Zhu
Findings of the Association for Computational Linguistics: EMNLP 2025
Runsong Zhao | Xin Liu | Xinyu Liu | Pengcheng Huang | Chunyang Xiao | Tong Xiao | JingBo Zhu
Findings of the Association for Computational Linguistics: EMNLP 2025
Using special tokens (e.g., gist, memory, or compressed tokens) to compress context information is a common practice for large language models (LLMs). However, existing approaches often neglect that position encodings inherently induce local inductive biases in models, causing the compression process to ignore holistic contextual dependencies. We propose **Enhanced Position Layout (EPL)**, a simple yet effective method that improves the context compression capability of LLMs by only adjusting position IDs, the numerical identifiers that specify token positions. EPL minimizes the distance between context tokens and their corresponding special tokens and at the same time maintains the sequence order in position IDs between context tokens, special tokens, and the subsequent tokens. Integrating EPL into our best performing context compression model results in 1.9 ROUGE-1 F1 improvement on out-of-domain question answering datasets in average. When extended to multimodal scenarios, EPL brings an average accuracy gain of 2.6 to vision compression LLMs.
2024
Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-Context Models
Xinyu Liu | Runsong Zhao | Pengcheng Huang | Chunyang Xiao | Bei Li | Jingang Wang | Tong Xiao | JingBo Zhu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xinyu Liu | Runsong Zhao | Pengcheng Huang | Chunyang Xiao | Bei Li | Jingang Wang | Tong Xiao | JingBo Zhu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model’s effective memory length. However, through thorough investigations, we find limitations for currently existing evaluations on model’s memory. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompt and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models.