J-ClinicalBench: A Benchmark for Evaluating Large Language Models on Practical Clinical Tasks in Japanese

Seiji Shimizu, Tomohiro Nishiyama, HISADA Shohei, Yamato Himi, Shoko Wakamiya, Yuki Yanagisawa, Masami Tsuchiya, Satoko Hori, Eiji ARAMAKI


Abstract
Recent advances in large language models (LLMs) have accelerated the NLP applications in the medical and clinical domains. However, evaluations remain limited for non-English languages, such as Japanese, where clinical corpora are particularly scarce. To address this gap, we present J-ClinicalBench, a publicly available benchmark designed to reflect realistic Japanese clinical tasks. We first created 227 expert-authored clinical documents and newly constructed five datasets for core clinical tasks. Building on these datasets, J-ClinicalBench comprises nine clinical tasks spanning clinical language reasoning, generation, and understanding. We establish baseline performance on J-ClinicalBench by evaluating state-of-the-art proprietary and Japanese open-source LLMs, providing the first assessment of their utility in practical clinical scenarios. By releasing this benchmark, we aim to foster the development and evaluation of clinically applicable LLMs in Japanese healthcare, bridging the current gap between clinical NLP research and clinical practice.
Anthology ID:
2026.lrec-main.28
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
419–430
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.28/
DOI:
Bibkey:
Cite (ACL):
Seiji Shimizu, Tomohiro Nishiyama, HISADA Shohei, Yamato Himi, Shoko Wakamiya, Yuki Yanagisawa, Masami Tsuchiya, Satoko Hori, and Eiji ARAMAKI. 2026. J-ClinicalBench: A Benchmark for Evaluating Large Language Models on Practical Clinical Tasks in Japanese. International Conference on Language Resources and Evaluation, main:419–430.
Cite (Informal):
J-ClinicalBench: A Benchmark for Evaluating Large Language Models on Practical Clinical Tasks in Japanese (Shimizu et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.28.pdf