Jiasen Lu
2026
Reducing Token Redundancy in LVLMs: A Systematic Review of Token Pruning Methods
Hanzhang Yuan | Mengxuan Hu | Wenhao Zhang | Tianlong Wang | Zhongliang Zhou | Jiasen Lu | Sheng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hanzhang Yuan | Mengxuan Hu | Wenhao Zhang | Tianlong Wang | Zhongliang Zhou | Jiasen Lu | Sheng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Vision-Language Models (LVLMs) excel at visual understanding but face severe computational bottlenecks when processing high-resolution images and long videos due to massive visual token counts. Token pruning mitigates this by selectively removing less informative tokens while maintaining performance. However, existing methods vary widely in pruning location (vision encoder vs. LLM decoder), importance criteria (attention vs. similarity vs. learned scores), and application strategy, lacking systematic comparison. This survey presents the first comprehensive review of token pruning for LVLMs. We propose a taxonomy categorizing methods into vision-side, LLM-side, and hybrid paradigms, systematically analyze token selection mechanisms and pruning strategy. We further discuss evaluation protocols and identify key challenges including prompt-adaptive pruning and hardware-aware design. Our survey provides a structured foundation for this rapidly growing research area.
2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho | Jiasen Lu | Dustin Schwenk | Hannaneh Hajishirzi | Aniruddha Kembhavi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Jaemin Cho | Jiasen Lu | Dustin Schwenk | Hannaneh Hajishirzi | Aniruddha Kembhavi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family – LXMERT – finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERT’s image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.
2017
ParlAI: A Dialog Research Software Platform
Alexander Miller | Will Feng | Dhruv Batra | Antoine Bordes | Adam Fisch | Jiasen Lu | Devi Parikh | Jason Weston
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Alexander Miller | Will Feng | Dhruv Batra | Antoine Bordes | Adam Fisch | Jiasen Lu | Devi Parikh | Jason Weston
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We introduce ParlAI (pronounced “par-lay”), an open-source software platform for dialog research implemented in Python, available at http://parl.ai. Its goal is to provide a unified framework for sharing, training and testing dialog models; integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning; and a repository of machine learning models for comparing with others’ models, and improving upon existing architectures. Over 20 tasks are supported in the first release, including popular datasets such as SQuAD, bAbI tasks, MCTest, WikiQA, QACNN, QADailyMail, CBT, bAbI Dialog, Ubuntu, OpenSubtitles and VQA. Several models are integrated, including neural models such as memory networks, seq2seq and attentive LSTMs.