Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada


Abstract
Clinical NLP increasingly relies on electronic health record (EHR) datato detect suicidal behaviors, treating clinical documentation as morereliable ground truth than social media. We argue that this framingobscures how EHR-based suicidality datasets encode a particularoperationalization of suicidality, shaped by who authors the data,how episodes are bounded, and how ambiguity is resolved. We groundthis argument in a case study of the ScAN dataset,built over MIMIC-III clinical notes. We show how governanceconstraints, ICD-based cohort selection, single-annotator labeling,and hospital-stay-level aggregation produce labels that foregroundclinician judgment, treat suicidality as a bounded episode, andassume that intent can be reliably inferred from documentation. Alinguistic analysis demonstrates that identical labels subsumeheterogeneous clinical framings differing in temporality, negation,and uncertainty, and that labeling patterns differ across insurancestatus. We argue the clinical NLP community should examine theassumptions embedded in suicidality datasets before interpretingtheir labels as ground truth.
Anthology ID:
2026.clpsych-1.9
Volume:
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Aya Zirikly, Kfir Bar, Sean MacAvaney, Molly Ireland, Yaakov Ophir, Dana Atzil-Slonim, Vasudha Varadarajan, Steven Bedrick, Bart Desmet
Venues:
CLPsych | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–127
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.clpsych-1.9/
DOI:
Bibkey:
Cite (ACL):
Priyanshi Garg, Ishita Rao, Jieqiong Ding, and Amandalynne Paullada. 2026. Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text. In Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), pages 119–127, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text (Garg et al., CLPsych 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.clpsych-1.9.pdf