DualAlign: Generating Clinically Grounded Synthetic Data

Rumeng Li, XWang, Hong yu


Abstract
Synthetic clinical data are essential for advancing AI in healthcare, given strict privacy constraints on electronic health records (EHRs), the scarcity of annotated data for rare or slowly progressing conditions, and demographic biases in observational cohorts. Large language models (LLMs) can generate fluent clinical text, but ensuring that such outputs are both clinically grounded and useful for downstream modeling remains challenging. We present DualAlign, a disease-agnostic framework for generating privacy-preserving, clinically faithful synthetic EHR narratives. DualAlign improves generation fidelity through two complementary alignment mechanisms: persona alignment, which conditions generation on patient demographics and risk factors, and symptom-trajectory alignment, which grounds narratives in empirically observed longitudinal symptom patterns. Using Alzheimer’s disease (AD) as a case study, DualAlign produces context-aware, symptom-rich sentences that more closely reflect real-world clinical documentation. Augmenting limited gold-standard data with DualAlign substantially improves AD symptom classification, outperforming both gold-only training and unconstrained synthetic baselines. Overall, DualAlign provides a generalizable approach for generating high-utility synthetic clinical text in chronic and progressive diseases, reducing annotation burden while enabling scalable and privacy-conscious clinical NLP research.
Anthology ID:
2026.findings-acl.1405
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28189–28208
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1405/
DOI:
Bibkey:
Cite (ACL):
Rumeng Li, XWang, and Hong yu. 2026. DualAlign: Generating Clinically Grounded Synthetic Data. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28189–28208, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
DualAlign: Generating Clinically Grounded Synthetic Data (Li et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1405.pdf
Checklist:
 2026.findings-acl.1405.checklist.pdf