@inproceedings{ren-etal-2023-context,
    title = "Context Compression for Auto-regressive Transformers with Sentinel Tokens",
    author = "Ren, Siyu  and
      Jia, Qi  and
      Zhu, Kenny",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2023.emnlp-main.794/",
    doi = "10.18653/v1/2023.emnlp-main.794",
    pages = "12860--12867",
    abstract = "The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at \url{https://github.com/DRSY/KV_Compression}."
}Markdown (Informal)
[Context Compression for Auto-regressive Transformers with Sentinel Tokens](https://preview.aclanthology.org/ingest-emnlp/2023.emnlp-main.794/) (Ren et al., EMNLP 2023)
ACL