Zahidul Islam


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
AdaptMerge: Inference Time Adaptive Visual and Language-Guided Token Merging for Efficient Large Multimodal Models
Zahidul Islam | Mrigank Rochan
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in Large Multimodal Models (LMMs) have showcased impressive visual understanding and vision-language reasoning capabilities, yet their computational cost hinders practical deployment, especially in resource-constrained settings. A key bottleneck is the large number of visual tokens generated by its vision encoders, which increases latency and memory demands. Existing token reduction methods often require costly fine-tuning or apply fixed token reduction ratios, ignoring image complexity and vision-language interactions. We propose AdaptMerge, a training-free, inference-time token merging strategy that adaptively reduces visual tokens by leveraging feature diversity and language-guided relevance. By dynamically adjusting to image complexity and ensuring multimodal coherence, AdaptMerge significantly lowers floating-point operations while improving performance. Extensive experiments on Google’s latest Gemma 3 models (4B and 12B parameters) across four challenging benchmarks demonstrate that AdaptMerge outperforms state-of-the-art token reduction techniques, achieving both reduced computational costs and improved performance, thereby providing a practical pathway to more efficient LMMs.