MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa
Abstract
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.- Anthology ID:
- 2026.findings-eacl.284
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5349–5370
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.284/
- DOI:
- Cite (ACL):
- Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, and Kiyoharu Aizawa. 2026. MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5349–5370, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding (Baek et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.284.pdf