Boqi Chen
2026
Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
Boqi Chen | Xudong Liu | Jianing Qiu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Boqi Chen | Xudong Liu | Jianing Qiu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
2025
Seeing Beyond: Enhancing Visual Question Answering with Multi-Modal Retrieval
Boqi Chen | Anuj Khare | Gaurav Kumar | Arjun Akula | Pradyumna Narayana
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Boqi Chen | Anuj Khare | Gaurav Kumar | Arjun Akula | Pradyumna Narayana
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Multi-modal Large language models (MLLMs) have made significant strides in complex content understanding and reasoning. However, they still suffer from model hallucination and lack of specific knowledge when facing challenging questions. To address these limitations, retrieval augmented generation (RAG) has emerged as an effective solution. While incorporating knowledge has led to improvements, it also highlights the need for a more robust knowledge selection strategy. For multi-modal tasks, such as visual question answering (VQA), integrating all modalities is crucial in providing comprehensive information for accurate answers. Therefore, we propose to construct an encoder model for extracting joint embedding from all modalities, enabling alignment between the corresponding query and knowledge through contrastive learning. To further improve performance, we introduce an additional MLLM re-selection step, which selects the best matching knowledge from the top-k retrieved results of our alignment model. We evaluated our method, SeBe-VQA, on the Encyclopedic VQA dataset. Our knowledge retrieval results demonstrate the benefit of our multi-modal framework. By incorporating the retrieved knowledge along with the question, we achieve a significant performance improvement compared with the previous method and scenarios without knowledge provision.
2021
Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage
Isidora Tourni | Lei Guo | Taufiq Husada Daryanto | Fabian Zhafransyah | Edward Edberg Halim | Mona Jalal | Boqi Chen | Sha Lai | Hengchang Hu | Margrit Betke | Prakash Ishwar | Derry Tanti Wijaya
Findings of the Association for Computational Linguistics: EMNLP 2021
Isidora Tourni | Lei Guo | Taufiq Husada Daryanto | Fabian Zhafransyah | Edward Edberg Halim | Mona Jalal | Boqi Chen | Sha Lai | Hengchang Hu | Margrit Betke | Prakash Ishwar | Derry Tanti Wijaya
Findings of the Association for Computational Linguistics: EMNLP 2021
News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called “frames” in communication research. We study, for the first time, the value of combining lead images and their contextual information with text to identify the frame of a given news article. We observe that using multiple modes of information(article- and image-derived features) improves prediction of news frames over any single mode of information when the images are relevant to the frames of the headlines. We also observe that frame image relevance is related to the ease of conveying frames via images, which we call frame concreteness. Additionally, we release the first multimodal news framing dataset related to gun violence in the U.S., curated and annotated by communication researchers. The dataset will allow researchers to further examine the use of multiple information modalities for studying media framing.