Detecting Personal Information in Training Corpora: an Analysis
Nishant Subramani, Sasha Luccioni, Jesse Dodge, Margaret Mitchell
Abstract
Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively under-explored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.- Anthology ID:
- 2023.trustnlp-1.18
- Volume:
- Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, Rahul Gupta
- Venue:
- TrustNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 208–220
- Language:
- URL:
- https://aclanthology.org/2023.trustnlp-1.18
- DOI:
- 10.18653/v1/2023.trustnlp-1.18
- Cite (ACL):
- Nishant Subramani, Sasha Luccioni, Jesse Dodge, and Margaret Mitchell. 2023. Detecting Personal Information in Training Corpora: an Analysis. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 208–220, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Detecting Personal Information in Training Corpora: an Analysis (Subramani et al., TrustNLP 2023)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2023.trustnlp-1.18.pdf