Abstract
As Electronic Health Records (EHR) become ubiquitous in healthcare systems worldwide, including in Arabic-speaking countries, the dual imperative of safeguarding patient privacy and leveraging data for research and quality improvement grows. This paper presents a first-of-its-kind automated de-identification pipeline for medical text specifically tailored for the Arabic language. This includes accurate medical Named Entity Recognition (NER) for identifying personal information; data obfuscation models to replace sensitive entities with fake entities; and an implementation that natively scales to large datasets on commodity clusters. This research makes two contributions. First, we adapt two existing NER architectures— BERT For Token Classification (BFTC) and BiLSTM-CNN-Char – to accommodate the unique syntactic and morphological characteristics of the Arabic language. Comparative analysis suggests that BFTC models outperform Bi-LSTM models, achieving higher F1 scores for both identifying and redacting personally identifiable information (PII) from Arabic medical texts. Second, we augment the deep learning models with a contextual parser engine to handle commonly missed entities. Experiments show that the combined pipeline demonstrates superior performance with micro F1 scores ranging from 0.94 to 0.98 on the test dataset, which is a translated version of the i2b2 2014 de-identification challenge, across 17 sensitive entities. This level of accuracy is in line with that achieved with manual de-identification by domain experts, suggesting that a fully automated and scalable process is now viable.- Anthology ID:
- 2023.arabicnlp-1.4
- Volume:
- Proceedings of ArabicNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore (Hybrid)
- Editors:
- Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali, Nadi Tomeh, Ibrahim Abu Farha, Nizar Habash, Salam Khalifa, Amr Keleg, Hatem Haddad, Imed Zitouni, Khalil Mrini, Rawan Almatham
- Venues:
- ArabicNLP | WS
- SIG:
- SIGARAB
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 33–40
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.arabicnlp-1.4/
- DOI:
- 10.18653/v1/2023.arabicnlp-1.4
- Cite (ACL):
- Veysel Kocaman, Youssef Mellah, Hasham Haq, and David Talby. 2023. Automated De-Identification of Arabic Medical Records. In Proceedings of ArabicNLP 2023, pages 33–40, Singapore (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- Automated De-Identification of Arabic Medical Records (Kocaman et al., ArabicNLP 2023)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.arabicnlp-1.4.pdf