Kaleb Smith


2025

pdf bib
MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters
Amin Dada | Osman Koras | Marie Bauer | Amanda Butler | Kaleb Smith | Jens Kleesiek | Julian Friedrich
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

While increasing patients’ access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.

2023

pdf bib
On the Impact of Cross-Domain Data on German Language Models
Amin Dada | Aokun Chen | Cheng Peng | Kaleb Smith | Ahmad Idrissi-Yaghir | Constantin Seibold | Jianning Li | Lars Heiliger | Christoph Friedrich | Daniel Truhn | Jan Egger | Jiang Bian | Jens Kleesiek | Yonghui Wu
Findings of the Association for Computational Linguistics: EMNLP 2023

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.