Abstract
We deal with the pseudonymization of those stretches of text in emails that might allow to identify real individual persons. This task is decomposed into two steps. First, named entities carrying privacy-sensitive information (e.g., names of persons, locations, phone numbers or dates) are identified, and, second, these privacy-bearing entities are replaced by synthetically generated surrogates (e.g., a person originally named ‘John Doe’ is renamed as ‘Bill Powers’). We describe a system architecture for surrogate generation and evaluate our approach on CodeAlltag, a German email corpus.- Anthology ID:
- R19-1030
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
- Month:
- September
- Year:
- 2019
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 259–269
- Language:
- URL:
- https://aclanthology.org/R19-1030
- DOI:
- 10.26615/978-954-452-056-4_030
- Cite (ACL):
- Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 259–269, Varna, Bulgaria. INCOMA Ltd..
- Cite (Informal):
- De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus (Eder et al., RANLP 2019)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/R19-1030.pdf