Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts

Shayna Gardiner, Tania Habib, Kevin Humphreys, Masha Azizi, Frederic Mailhot, Anne Paling, Preston Thomas, Nathan Zhang


Abstract
Large language models in public-facing industrial applications must accurately process data for the domain in which they are deployed, but they must not leak sensitive or confidential information when used. We present a process for anonymizing training data, a framework for quantitatively and qualitatively assessing the effectiveness of this process, and an assessment of the effectiveness of models fine-tuned on anonymized data in comparison with commercially available LLM APIs.
Anthology ID:
2024.caldpseudo-1.8
Volume:
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu
Venues:
CALD-pseudo | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
64–75
Language:
URL:
https://aclanthology.org/2024.caldpseudo-1.8
DOI:
Bibkey:
Cite (ACL):
Shayna Gardiner, Tania Habib, Kevin Humphreys, Masha Azizi, Frederic Mailhot, Anne Paling, Preston Thomas, and Nathan Zhang. 2024. Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), pages 64–75, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts (Gardiner et al., CALD-pseudo-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2024.caldpseudo-1.8.pdf
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2024.caldpseudo-1.8.mp4