Jaehyun Jeon
2025
Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
Yejin Choi
|
Jaewoo Park
|
Janghan Yoon
|
Saejin Kim
|
Jaehyun Jeon
|
Youngjae Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Rapid advances in Multimodal Large Language Models (MLLMs) have extended information retrieval beyond text, enabling access to complex real-world documents that combine both textual and visual content. However, most documents are private, either owned by individuals or confined within corporate silos, and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross-modal pre-questions (preQs) before retrieval. Unlike earlier multimodal retrievers that embed entire documents as a single vector, PREMIR leverages preQs, decomposed from documents into finer token-level representations across modalities, enabling richer contextual understanding. Experiments show that PREMIR achieves state-of-the-art performance on out-of-distribution benchmarks, including closed-domain and multilingual settings, outperforming strong baselines across all metrics. We confirm the contribution of each component through in-depth ablation studies, and qualitative analyses of the generated preQs further highlight the framework’s robustness in real-world settings.
2024
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
Jiwan Chung
|
Seungwon Lim
|
Jaehyun Jeon
|
Seungbeen Lee
|
Youngjae Yu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability?In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.
Search
Fix author
Co-authors
- Youngjae Yu 2
- Yejin Choi 1
- Jiwan Chung 1
- Saejin Kim 1
- Seungbeen Lee 1
- show all...