Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts
Shayna Gardiner, Tania Habib, Kevin Humphreys, Masha Azizi, Frederic Mailhot, Anne Paling, Preston Thomas, Nathan Zhang
Abstract
Large language models in public-facing industrial applications must accurately process data for the domain in which they are deployed, but they must not leak sensitive or confidential information when used. We present a process for anonymizing training data, a framework for quantitatively and qualitatively assessing the effectiveness of this process, and an assessment of the effectiveness of models fine-tuned on anonymized data in comparison with commercially available LLM APIs.- Anthology ID:
- 2024.caldpseudo-1.8
- Volume:
- Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
- Month:
- March
- Year:
- 2024
- Address:
- St. Julian’s, Malta
- Editors:
- Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu
- Venues:
- CALD-pseudo | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 64–75
- Language:
- URL:
- https://aclanthology.org/2024.caldpseudo-1.8
- DOI:
- Cite (ACL):
- Shayna Gardiner, Tania Habib, Kevin Humphreys, Masha Azizi, Frederic Mailhot, Anne Paling, Preston Thomas, and Nathan Zhang. 2024. Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), pages 64–75, St. Julian’s, Malta. Association for Computational Linguistics.
- Cite (Informal):
- Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts (Gardiner et al., CALD-pseudo-WS 2024)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2024.caldpseudo-1.8.pdf