Exploring the Capability of Multimodal LLMs with Yonkoma Manga: The YManga Dataset and Its Challenging Tasks

Qi Yang, Jingjie Zeng, Liang Yang, Zhihao Yang, Hongfei Lin


Abstract
Yonkoma Manga, characterized by its four-panel structure, presents unique challenges due to its rich contextual information and strong sequential features. To address the limitations of current multimodal large language models (MLLMs) in understanding this type of data, we create a novel dataset named YManga from the Internet. After filtering out low-quality content, we collect a dataset of 1,015 yonkoma strips, containing 10,150 human annotations. We then define three challenging tasks for this dataset: panel sequence detection, generation of the author’s creative intention, and description generation for masked panels. These tasks progressively introduce the complexity of understanding and utilizing such image-text data. To the best of our knowledge, YManga is the first dataset specifically designed for yonkoma manga strips understanding. Extensive experiments conducted on this dataset reveal significant challenges faced by current multimodal large language models. Our results show a substantial performance gap between models and humans across all three tasks.
Anthology ID:
2024.findings-emnlp.506
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8672–8687
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.506/
DOI:
10.18653/v1/2024.findings-emnlp.506
Bibkey:
Cite (ACL):
Qi Yang, Jingjie Zeng, Liang Yang, Zhihao Yang, and Hongfei Lin. 2024. Exploring the Capability of Multimodal LLMs with Yonkoma Manga: The YManga Dataset and Its Challenging Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8672–8687, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Exploring the Capability of Multimodal LLMs with Yonkoma Manga: The YManga Dataset and Its Challenging Tasks (Yang et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.506.pdf