Zichen Wen
2026
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Chenfei Liao | Wensong Wang | Zichen Wen | Xu Zheng | Yiyu Wang | Haocong He | Yuanhuiyi Lyu | Lutao Jiang | Xin Zou | Yuqian Fu | Bin Ren | Linfeng Zhang | Xuming Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenfei Liao | Wensong Wang | Zichen Wen | Xu Zheng | Yiyu Wang | Haocong He | Yuanhuiyi Lyu | Lutao Jiang | Xin Zou | Yuqian Fu | Bin Ren | Linfeng Zhang | Xuming Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch. In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks. Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity. Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.
StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding
Junxi Wang | Te Sun | Jiayi Zhu | Junxian Li | Haowen Xu | Zichen Wen | Xuming Hu | Zhiyu li | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Junxi Wang | Te Sun | Jiayi Zhu | Junxian Li | Haowen Xu | Zichen Wen | Xuming Hu | Zhiyu li | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Vision agent memory has shown remarkable effectiveness in long-video understanding; however, storing such memory for videos incurs substantial overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for isolated nodes and edge-aware weight pruning for connected nodes, evicting redundant memory nodes while maintaining accuracy. In addition, we introduce a time-decay memory retrieval mechanism to mitigate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web, and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87× speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.
AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems
Yulang Chen | Haoxuan Peng | Jinyan Liu | Zichen Wen | Dongrui Liu | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yulang Chen | Haoxuan Peng | Jinyan Liu | Zichen Wen | Dongrui Liu | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex tasks. However, manually designing optimal communication topologies is labor-intensive, while automated expansion methods often result in bloated structures with redundant agents, leading to excessive token consumption. To address this problem, we introduce AgentSlimming, a plug-and-play compression framework for graph-structured multi-agent workflows. Motivated by the AgentPruner and AgentQuant in neural networks, AgentSlimming compresses workflows by firstly estimate the importance score of each agent with a hybrid mechanism, and then removing redundant agents or replacing them with low-cost ones, where each operation is then validated with a baseline-anchored acceptance rule to prevent performance collapse. Experiments show that AgentSlimming reduces average token cost by up to 78.9% with negligible performance degradation, and even sometimes improves accuracy, achieving a strong Pareto-optimal trade-off between cost and quality.
2025
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
Shaobo Wang | Xiangqi Jin | Ziming Wang | Jize Wang | Jiajun Zhang | Kaixin Li | Zichen Wen | Zhong Li | Conghui He | Xuming Hu | Linfeng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shaobo Wang | Xiangqi Jin | Ziming Wang | Jize Wang | Jiajun Zhang | Kaixin Li | Zichen Wen | Zhong Li | Conghui He | Xuming Hu | Linfeng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model’s predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4× speedup.
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Zichen Wen | Yifeng Gao | Weijia Li | Conghui He | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Zichen Wen | Yifeng Gao | Weijia Li | Conghui He | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods. Codes are available in the supplementary materials.
Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators. Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99× and 2.99× speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators.
Search
Fix author
Co-authors
- Linfeng Zhang 6
- Conghui He 3
- Xuming Hu 3
- Yifeng Gao 2
- Weijia Li 2
- Shaobo Wang 2
- Yulang Chen 1
- Yuqian Fu 1
- Haocong He 1
- Lutao Jiang 1
- Xiangqi Jin 1
- Junxian Li 1
- Zhiyu Li 1
- Kaixin Li 1
- Zhong Li 1
- Chenfei Liao 1
- Jinyan Liu 1
- Dongrui Liu 1
- Yuanhuiyi Lyu 1
- Haoxuan Peng 1
- Bin Ren 1
- Te Sun 1
- Wensong Wang 1
- Yiyu Wang 1
- Junxi Wang 1
- Ziming Wang 1
- Jize Wang 1
- Haowen Xu 1
- Jiajun Zhang 1
- Junyuan Zhang 1
- Qintong Zhang 1
- Xu Zheng 1
- Jiayi Zhu 1
- Xin Zou 1