Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Keito Sasagawa; Koki Maeda; Issa Sugiura; Shuhei Kurita; Naoaki Okazaki; Daisuke Kawahara

Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Keito Sasagawa, Koki Maeda, Issa Sugiura, Shuhei Kurita, Naoaki Okazaki, Daisuke Kawahara

Abstract

To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose Japanese multimodal datasets for rapidly developing a Japanese multimodal model. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data using an existing large language model and a VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content. The resulting VLM, dataset and code used for training is publicly available.

Anthology ID:: 2025.naacl-demo.38
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Nouha Dziri, Sean (Xiang) Ren, Shizhe Diao
Venues:: NAACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 470–484
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-demo.38/
DOI:
Bibkey:
Cite (ACL):: Keito Sasagawa, Koki Maeda, Issa Sugiura, Shuhei Kurita, Naoaki Okazaki, and Daisuke Kawahara. 2025. Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), pages 470–484, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model (Sasagawa et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-demo.38.pdf

PDF Cite Search Fix data