CROP: Contextual Region-Oriented Visual Token Pruning

Jiawei Guo; Feifei Zhai; Pu Jian; Qianrun Wei; Yu Zhou

CROP: Contextual Region-Oriented Visual Token Pruning

Jiawei Guo, Feifei Zhai, Pu Jian, Qianrun Wei, Yu Zhou

Abstract

Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance.

Anthology ID:: 2025.emnlp-main.492
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9767–9783
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.492/
DOI:
Bibkey:
Cite (ACL):: Jiawei Guo, Feifei Zhai, Pu Jian, Qianrun Wei, and Yu Zhou. 2025. CROP: Contextual Region-Oriented Visual Token Pruning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9767–9783, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CROP: Contextual Region-Oriented Visual Token Pruning (Guo et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.492.pdf
Checklist:: 2025.emnlp-main.492.checklist.pdf

PDF Cite Search Checklist Fix data