Janghan Yoon
2026
Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
Jaehyun Jeon | Min Soo Kim | Janghan Yoon | Sumin Shim | Yejin Choi | Hanbin Kim | Dae Hyun Kim | Youngjae Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jaehyun Jeon | Min Soo Kim | Janghan Yoon | Sumin Shim | Yejin Choi | Hanbin Kim | Dae Hyun Kim | Youngjae Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
User interface (UI) design goes beyond visuals to shape user experience (UX), underscoring the shift toward UI/UX as a unified concept. While recent studies have explored UI evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking how design choices influence user behavior at scale. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for multimodal understanding of how UI/UX design affects user behavior, built on 300 real-world UI image pairs from industry A/B tests, with empirically validated winners that induced more user actions. For future design progress in practice, post-hoc understanding of why such winners succeed with mass users is also required; we support this via expert-curated key interpretations for each instance. Experiments across multiple MLLMs on WiserUI-Bench for two main tasks, (1) predicting the more effective UI image between an A/B-tested pair, and (2) explaining it post-hoc in alignment with expert interpretations, show that models exhibit limited understanding of the behavioral impact of UI/UX design. We believe our work will foster research on leveraging MLLMs for visual design in user behavior contexts.
2025
C2: Scalable Auto-Feedback for LLM-based Chart Generation
Woosung Koh | Janghan Yoon | MinHyung Lee | Youngjin Song | Jaegwan Cho | Jaehyun Kang | Taehyeon Kim | Se-Young Yun | Youngjae Yu | Bongshin Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Woosung Koh | Janghan Yoon | MinHyung Lee | Youngjin Song | Jaegwan Cho | Jaehyun Kang | Taehyeon Kim | Se-Young Yun | Youngjae Yu | Bongshin Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
Yejin Choi | Jaewoo Park | Janghan Yoon | Saejin Kim | Jaehyun Jeon | Youngjae Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yejin Choi | Jaewoo Park | Janghan Yoon | Saejin Kim | Jaehyun Jeon | Youngjae Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Rapid advances in Multimodal Large Language Models (MLLMs) have extended information retrieval beyond text, enabling access to complex real-world documents that combine both textual and visual content. However, most documents are private, either owned by individuals or confined within corporate silos, and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross-modal pre-questions (preQs) before retrieval. Unlike earlier multimodal retrievers that embed entire documents as a single vector, PREMIR leverages preQs, decomposed from documents into finer token-level representations across modalities, enabling richer contextual understanding. Experiments show that PREMIR achieves state-of-the-art performance on out-of-distribution benchmarks, including closed-domain and multilingual settings, outperforming strong baselines across all metrics. We confirm the contribution of each component through in-depth ablation studies, and qualitative analyses of the generated preQs further highlight the framework’s robustness in real-world settings.
Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
Jiwan Chung | Janghan Yoon | Junhyeong Park | Sangeyl Lee | Joowon Yang | Sooyeon Park | Youngjae Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiwan Chung | Janghan Yoon | Junhyeong Park | Sangeyl Lee | Joowon Yang | Sooyeon Park | Youngjae Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria—cyclic consistency, forward equivariance, and conjugated equivariance—our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.