Jeonghun Baek
2026
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Jeonghun Baek | Kazuki Egashira | Shota Onohara | Atsuyuki Miyai | Yuki Imajuku | Hikaru Ikuta | Kiyoharu Aizawa
Findings of the Association for Computational Linguistics: EACL 2026
Jeonghun Baek | Kazuki Egashira | Shota Onohara | Atsuyuki Miyai | Yuki Imajuku | Hikaru Ikuta | Kiyoharu Aizawa
Findings of the Association for Computational Linguistics: EACL 2026
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
2025
Harnessing PDF Data for Improving Japanese Large Multimodal Models
Jeonghun Baek | Akiko Aizawa | Kiyoharu Aizawa
Findings of the Association for Computational Linguistics: ACL 2025
Jeonghun Baek | Akiko Aizawa | Kiyoharu Aizawa
Findings of the Association for Computational Linguistics: ACL 2025
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Shota Onohara | Atsuyuki Miyai | Yuki Imajuku | Kazuki Egashira | Jeonghun Baek | Xiang Yue | Graham Neubig | Kiyoharu Aizawa
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Shota Onohara | Atsuyuki Miyai | Yuki Imajuku | Kazuki Egashira | Jeonghun Baek | Xiang Yue | Graham Neubig | Kiyoharu Aizawa
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)