A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection
Matheus Machado, Vinícius Vanzin, Dilvan Moreira, Luis Felipe Ensina, Fábio Lario
Abstract
Anaphylaxis is an acute, potentially life-threatening allergic reaction that requires rapid recognition in clinical settings. Natural language processing (NLP) approaches for automatic detection of anaphylaxis in clinical narratives can support large-scale analysis of health records and retrospective clinical research. However, such approaches depend on high-quality labeled corpora, and resources for Portuguese remain scarce. This paper introduces a corpus of Brazilian Portuguese clinical notes annotated by domain specialists for the presence or absence of anaphylaxis. The dataset comprises 969 clinical narratives drawn from three sources: clinician-authored synthetic clinical notes designed to represent realistic scenarios, case reports from the medical literature rewritten into note-like format by specialists, and a subset of de-identified notes from the publicly available SemClinBr corpus. All texts were reviewed and labeled by allergists using established clinical diagnostic criteria, and the corpus reflects realistic prevalence conditions, with approximately 5% positive cases. We describe the corpus design, data sources, annotation methodology, and composition, discuss potential research applications, and address ethical considerations. The corpus is intended as a reusable resource for Portuguese clinical NLP, supporting future work on document classification, information extraction, and language modeling in the medical domain.- Anthology ID:
- 2026.propor-2.15
- Volume:
- Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
- Month:
- April
- Year:
- 2026
- Address:
- Salvador, Brazil
- Editors:
- Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
- Venue:
- PROPOR
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 78–87
- Language:
- URL:
- https://preview.aclanthology.org/ingest-dnd/2026.propor-2.15/
- DOI:
- Cite (ACL):
- Matheus Machado, Vinícius Vanzin, Dilvan Moreira, Luis Felipe Ensina, and Fábio Lario. 2026. A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2, pages 78–87, Salvador, Brazil. Association for Computational Linguistics.
- Cite (Informal):
- A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection (Machado et al., PROPOR 2026)
- PDF:
- https://preview.aclanthology.org/ingest-dnd/2026.propor-2.15.pdf