Jaewoo Park
2026
GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance
Junhyeok Kim | Jaewoo Park | Junhee Park | Sangeyl Lee | Jiwan Chung | Jisung Kim | Ji Hoon Joung | Youngjae Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junhyeok Kim | Jaewoo Park | Junhee Park | Sangeyl Lee | Jiwan Chung | Jisung Kim | Ji Hoon Joung | Youngjae Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
For people affected by blindness and low vision (BLV), safe and independent navigation remains a major challenge, impacting over 2.2 billion individuals worldwide. Although multimodal large language models (MLLMs) offer new opportunities for assistive navigation, progress has been limited by the scarcity of accessibility-aware datasets, requiring labor-intensive, expert annotation. To this end, we introduce GuideDog, a novel dataset containing 22K image-description pairs (2K human-verified) capturing real-world pedestrian scenes across 46 countries. Our human-AI pipeline shifts annotation from generation to verification, grounded in established BLV guidance standards from experts and research, improving scalability while maintaining quality. We also present GuideDogQA, an 818-sample benchmark evaluating object recognition and depth perception. Experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs. Code and dataset will be publicly available.
2025
Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
Yejin Choi | Jaewoo Park | Janghan Yoon | Saejin Kim | Jaehyun Jeon | Youngjae Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yejin Choi | Jaewoo Park | Janghan Yoon | Saejin Kim | Jaehyun Jeon | Youngjae Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Rapid advances in Multimodal Large Language Models (MLLMs) have extended information retrieval beyond text, enabling access to complex real-world documents that combine both textual and visual content. However, most documents are private, either owned by individuals or confined within corporate silos, and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross-modal pre-questions (preQs) before retrieval. Unlike earlier multimodal retrievers that embed entire documents as a single vector, PREMIR leverages preQs, decomposed from documents into finer token-level representations across modalities, enabling richer contextual understanding. Experiments show that PREMIR achieves state-of-the-art performance on out-of-distribution benchmarks, including closed-domain and multilingual settings, outperforming strong baselines across all metrics. We confirm the contribution of each component through in-depth ablation studies, and qualitative analyses of the generated preQs further highlight the framework’s robustness in real-world settings.