De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus

Elisabeth Eder, Ulrike Krieg-Holz, Udo Hahn


Abstract
We deal with the pseudonymization of those stretches of text in emails that might allow to identify real individual persons. This task is decomposed into two steps. First, named entities carrying privacy-sensitive information (e.g., names of persons, locations, phone numbers or dates) are identified, and, second, these privacy-bearing entities are replaced by synthetically generated surrogates (e.g., a person originally named ‘John Doe’ is renamed as ‘Bill Powers’). We describe a system architecture for surrogate generation and evaluate our approach on CodeAlltag, a German email corpus.
Anthology ID:
R19-1030
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
259–269
Language:
URL:
https://aclanthology.org/R19-1030
DOI:
10.26615/978-954-452-056-4_030
Bibkey:
Cite (ACL):
Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 259–269, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus (Eder et al., RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/R19-1030.pdf