De-identification of Privacy-related Entities in Job Postings

Kristian Nørgaard Jensen; Mike Zhang; Barbara Plank

De-identification of Privacy-related Entities in Job Postings

Kristian Nørgaard Jensen, Mike Zhang, Barbara Plank

Abstract

De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve these baselines, we experiment with BERT representations, and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data helps to improve de-identification performance. While BERT representations improve performance, surprisingly “vanilla” BERT turned out to be more effective than BERT trained on Stackoverflow-related data.

Anthology ID:: 2021.nodalida-main.21
Volume:: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:: May 31--2 June
Year:: 2021
Address:: Reykjavik, Iceland (Online)
Editors:: Simon Dobnik, Lilja Øvrelid
Venue:: NoDaLiDa
SIG:
Publisher:: Linköping University Electronic Press, Sweden
Note:
Pages:: 210–221
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.nodalida-main.21/
DOI:
Bibkey:
Cite (ACL):: Kristian Nørgaard Jensen, Mike Zhang, and Barbara Plank. 2021. De-identification of Privacy-related Entities in Job Postings. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 210–221, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Cite (Informal):: De-identification of Privacy-related Entities in Job Postings (Jensen et al., NoDaLiDa 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.nodalida-main.21.pdf
Code: kris927b/JobStack
Data: JobStack

PDF Cite Search Code Fix data