Kazuma Kobayashi
2026
Building Effective Japanese Medical LLMs with an Open Recipe for Domain Adaptation through Continued Pre-training
Akiko Aizawa | Yuki Arase | Fei Cheng | Jiahao Huang | Zhiyi Huang | Junfeng Jiang | Teruhito Kanazawa | Daisuke Kawahara | Kazuma Kobayashi | Takashi Kodama | Sadao Kurohashi | Yusuke Oda | Yuma Tsuta | Zhen Wan | Zhishen Yang | Rio Yokota
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Akiko Aizawa | Yuki Arase | Fei Cheng | Jiahao Huang | Zhiyi Huang | Junfeng Jiang | Teruhito Kanazawa | Daisuke Kawahara | Kazuma Kobayashi | Takashi Kodama | Sadao Kurohashi | Yusuke Oda | Yuma Tsuta | Zhen Wan | Zhishen Yang | Rio Yokota
Proceedings of the Fifteenth Language Resources and Evaluation Conference
In high-stakes domains such as medicine, ensuring transparency of the training corpus is essential, with careful consideration of local healthcare landscapes; however, the majority of existing medical large language models (LLMs) have not disclosed the details of their training corpora. Here, we introduce an open recipe for domain adaptation of LLMs to the Japanese medical domain. We employed fully open-source Japanese general-domain LLMs as base models, whose pre-training datasets are also disclosed. To establish effective corpora for domain adaptation through continued pre-training, we started with small-scale medical datasets and ultimately constructed a medical corpus consisting of 79.6B tokens, incorporating local clinical guidelines, medical textbooks, and other domain-specific resources. The resulting LLM from continued pre-training, namely SIP-med-llm-8x13B, with an active parameter count of 22B, demonstrated favorable accuracy on benchmarks including the Japanese National Medical Examination. This performance was comparable to that of 70B-parameter open-weight models whose construction details remain non-transparent. This represents the first case in the Japanese medical field where complete corpus details have been disclosed for fully from-scratch development, providing important insights for future efforts to construct medical LLMs tailored to the specific characteristics of local contexts. The model is available publicly at this Hugging Face repository: https://huggingface.co/SIP-med-LLM/SIP-jmed-llm-2-8x13b-OP-instruct.
2025
Leveraging High-Resource English Corpora for Cross-lingual Domain Adaptation in Low-Resource Japanese Medicine via Continued Pre-training
Kazuma Kobayashi | Zhen Wan | Fei Cheng | Yuma Tsuta | Xin Zhao | Junfeng Jiang | Jiahao Huang | Zhiyi Huang | Yusuke Oda | Rio Yokota | Yuki Arase | Daisuke Kawahara | Akiko Aizawa | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2025
Kazuma Kobayashi | Zhen Wan | Fei Cheng | Yuma Tsuta | Xin Zhao | Junfeng Jiang | Jiahao Huang | Zhiyi Huang | Yusuke Oda | Rio Yokota | Yuki Arase | Daisuke Kawahara | Akiko Aizawa | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2025
Limited low-resource language corpora in professional domains like medicine hinder cross-lingual domain adaptation of pre-trained large language models (PLMs). While abundant English medical corpora could complement this scarcity, the effective mixture of English and target language, including machine-translated content, remains underexplored. We examined how linguistic features (e.g., token sizes and language proportions) affect performance on a Japanese–English medical knowledge benchmark. Through continued pre-training of a bilingual PLM on multilingual corpora with varying proportions of English and Japanese texts (both original and machine-translated), we analyzed correlations between linguistic features and fine-grained task performance. Our findings suggest a practical approach to optimizing multilingual corpora for cross-lingual domain adaptation, which requires leveraging specialized knowledge from English corpora while ensuring sufficient coverage of language-specific expressions in a target language (Japanese). Such insights will contribute to the development of multilingual models that effectively leverage English-language resources in various professional domains with low-resource languages.