Jeonghun Baek
2025
Harnessing PDF Data for Improving Japanese Large Multimodal Models
Jeonghun Baek
|
Akiko Aizawa
|
Kiyoharu Aizawa
Findings of the Association for Computational Linguistics: ACL 2025
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Shota Onohara
|
Atsuyuki Miyai
|
Yuki Imajuku
|
Kazuki Egashira
|
Jeonghun Baek
|
Xiang Yue
|
Graham Neubig
|
Kiyoharu Aizawa
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Search
Fix author
Co-authors
- Kiyoharu Aizawa 2
- Akiko Aizawa 1
- Kazuki Egashira 1
- Yuki Imajuku 1
- Atsuyuki Miyai 1
- show all...