“Beste Grüße, Maria Meyer” — Pseudonymization of Privacy-Sensitive Information in Emails
Elisabeth Eder, Michael Wiegand, Ulrike Krieg-Holz, Udo Hahn
Abstract
The exploding amount of user-generated content has spurred NLP research to deal with documents from various digital communication formats (tweets, chats, emails, etc.). Using these texts as language resources implies complying with legal data privacy regulations. To protect the personal data of individuals and preclude their identification, we employ pseudonymization. More precisely, we identify those text spans that carry information revealing an individual’s identity (e.g., names of persons, locations, phone numbers, or dates) and subsequently substitute them with synthetically generated surrogates. Based on CodE Alltag, a German-language email corpus, we address two tasks. The first task is to evaluate various architectures for the automatic recognition of privacy-sensitive entities in raw data. The second task examines the applicability of pseudonymized data as training data for such systems since models learned on original data cannot be published for reasons of privacy protection. As outputs of both tasks, we, first, generate a new pseudonymized version of CodE Alltag compliant with the legal requirements of the General Data Protection Regulation (GDPR). Second, we make accessible a tagger for recognizing privacy-sensitive information in German emails and similar text genres, which is trained on already pseudonymized data.- Anthology ID:
- 2022.lrec-1.79
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 741–752
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.79
- DOI:
- Cite (ACL):
- Elisabeth Eder, Michael Wiegand, Ulrike Krieg-Holz, and Udo Hahn. 2022. “Beste Grüße, Maria Meyer” — Pseudonymization of Privacy-Sensitive Information in Emails. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 741–752, Marseille, France. European Language Resources Association.
- Cite (Informal):
- “Beste Grüße, Maria Meyer” — Pseudonymization of Privacy-Sensitive Information in Emails (Eder et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2022.lrec-1.79.pdf