RareSyn: Health Record Synthesis for Rare Disease Diagnosis

Huimin Wang, Yutian Zhao, Yefeng Zheng, Xian Wu


Abstract
Diagnosis based on Electronic Health Records (EHRs) often struggles with data scarcity and privacy concerns. To address these issues, we introduce RareSyn, an innovative data synthesis approach designed to augment and de-identify EHRs, with a focus on rare diseases. The core insight of RareSyn involves using seed EHRs of rare diseases to recall similar records from both common and rare diseases, and then leveraging Large Language Models to substitute the key medical information (e.g., symptoms or examination details) in these records with information from the knowledge graph, thereby generating new EHRs. We first train a transformer Encoder with contrastive learning to integrate various types of medical knowledge. Then, RareSyn engages in iterative processes of recalling similar EHRs, structuring EHRs, revising EHRs, and generating new EHRs until the produced EHRs achieve extensive coverage of the rare disease knowledge. We assess RareSyn based on its utility for diagnosis modeling, the diversity of medical knowledge it incorporates, and the privacy of the synthesized EHRs. Extensive experiments demonstrate its effectiveness in improving disease diagnosis, enhancing diversity, and maintaining privacy.
Anthology ID:
2025.emnlp-main.620
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12322–12338
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.620/
DOI:
Bibkey:
Cite (ACL):
Huimin Wang, Yutian Zhao, Yefeng Zheng, and Xian Wu. 2025. RareSyn: Health Record Synthesis for Rare Disease Diagnosis. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12322–12338, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
RareSyn: Health Record Synthesis for Rare Disease Diagnosis (Wang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.620.pdf
Checklist:
 2025.emnlp-main.620.checklist.pdf