Harnessing PDF Data for Improving Japanese Large Multimodal Models

Jeonghun Baek; Akiko Aizawa; Kiyoharu Aizawa

Harnessing PDF Data for Improving Japanese Large Multimodal Models

Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa

Abstract

Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.

Anthology ID:: 2025.findings-acl.108
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2108–2123
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.108/
DOI:
Bibkey:
Cite (ACL):: Jeonghun Baek, Akiko Aizawa, and Kiyoharu Aizawa. 2025. Harnessing PDF Data for Improving Japanese Large Multimodal Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2108–2123, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Harnessing PDF Data for Improving Japanese Large Multimodal Models (Baek et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.108.pdf

PDF Cite Search Fix data