Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text
Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada
Abstract
Clinical NLP increasingly relies on electronic health record (EHR) datato detect suicidal behaviors, treating clinical documentation as morereliable ground truth than social media. We argue that this framingobscures how EHR-based suicidality datasets encode a particularoperationalization of suicidality, shaped by who authors the data,how episodes are bounded, and how ambiguity is resolved. We groundthis argument in a case study of the ScAN dataset,built over MIMIC-III clinical notes. We show how governanceconstraints, ICD-based cohort selection, single-annotator labeling,and hospital-stay-level aggregation produce labels that foregroundclinician judgment, treat suicidality as a bounded episode, andassume that intent can be reliably inferred from documentation. Alinguistic analysis demonstrates that identical labels subsumeheterogeneous clinical framings differing in temporality, negation,and uncertainty, and that labeling patterns differ across insurancestatus. We argue the clinical NLP community should examine theassumptions embedded in suicidality datasets before interpretingtheir labels as ground truth.- Anthology ID:
- 2026.clpsych-1.9
- Volume:
- Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Aya Zirikly, Kfir Bar, Sean MacAvaney, Molly Ireland, Yaakov Ophir, Dana Atzil-Slonim, Vasudha Varadarajan, Steven Bedrick, Bart Desmet
- Venues:
- CLPsych | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 119–127
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.clpsych-1.9/
- DOI:
- Cite (ACL):
- Priyanshi Garg, Ishita Rao, Jieqiong Ding, and Amandalynne Paullada. 2026. Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text. In Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), pages 119–127, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text (Garg et al., CLPsych 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.clpsych-1.9.pdf