Detecting Personal Information in Training Corpora: an Analysis

Nishant Subramani, Sasha Luccioni, Jesse Dodge, Margaret Mitchell


Abstract
Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively under-explored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.
Anthology ID:
2023.trustnlp-1.18
Volume:
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, Rahul Gupta
Venue:
TrustNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
208–220
Language:
URL:
https://aclanthology.org/2023.trustnlp-1.18
DOI:
10.18653/v1/2023.trustnlp-1.18
Bibkey:
Cite (ACL):
Nishant Subramani, Sasha Luccioni, Jesse Dodge, and Margaret Mitchell. 2023. Detecting Personal Information in Training Corpora: an Analysis. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 208–220, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Detecting Personal Information in Training Corpora: an Analysis (Subramani et al., TrustNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2023.trustnlp-1.18.pdf