Abstract
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve these baselines, we experiment with BERT representations, and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data helps to improve de-identification performance. While BERT representations improve performance, surprisingly “vanilla” BERT turned out to be more effective than BERT trained on Stackoverflow-related data.- Anthology ID:
- 2021.nodalida-main.21
- Volume:
- Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May 31--2 June
- Year:
- 2021
- Address:
- Reykjavik, Iceland (Online)
- Editors:
- Simon Dobnik, Lilja Øvrelid
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- Linköping University Electronic Press, Sweden
- Note:
- Pages:
- 210–221
- Language:
- URL:
- https://aclanthology.org/2021.nodalida-main.21
- DOI:
- Cite (ACL):
- Kristian Nørgaard Jensen, Mike Zhang, and Barbara Plank. 2021. De-identification of Privacy-related Entities in Job Postings. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 210–221, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
- Cite (Informal):
- De-identification of Privacy-related Entities in Job Postings (Jensen et al., NoDaLiDa 2021)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/2021.nodalida-main.21.pdf
- Code
- kris927b/JobStack
- Data
- JobStack