Yifeng Gao
2026
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
Yujie Chen | Tailai Chen | Yifeng Gao | Zoe Wanying He | Yijue Xu | Shaobo Wang | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yujie Chen | Tailai Chen | Yifeng Gao | Zoe Wanying He | Yijue Xu | Shaobo Wang | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward semantic fixing points, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git .
2025
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Zichen Wen | Yifeng Gao | Weijia Li | Conghui He | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Zichen Wen | Yifeng Gao | Weijia Li | Conghui He | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods. Codes are available in the supplementary materials.
Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators. Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99× and 2.99× speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators.