Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models

Tobias Deußer; Max Hahnbück; Tobias Uelwer; Cong Zhao; Christian Bauckhage; Rafet Sifa

Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models

Tobias Deußer, Max Hahnbück, Tobias Uelwer, Cong Zhao, Christian Bauckhage, Rafet Sifa

Abstract

Protecting personal and sensitive information in textual data is increasingly crucial, especially when leveraging large language models (LLMs) that may pose privacy risks due to their API-based access. We introduce a novel approach and pipeline for anonymizing text across arbitrary domains without the need for manually labeled data or extensive computational resources. Our method employs knowledge distillation from LLMs into smaller encoder-only models via named entity recognition (NER) coupled with regular expressions to create a lightweight model capable of effective anonymization while preserving the semantic and contextual integrity of the data. This reduces computational overhead, enabling deployment on less powerful servers or even personal computing devices. Our findings suggest that knowledge distillation offers a scalable, resource-efficient pathway for anonymization, balancing privacy preservation with model performance and computational efficiency.

Anthology ID:: 2025.coling-industry.20
Volume:: Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 243–250
Language:
URL:: https://preview.aclanthology.org/Author-page-Marten-During-lu/2025.coling-industry.20/
DOI:
Bibkey:
Cite (ACL):: Tobias Deußer, Max Hahnbück, Tobias Uelwer, Cong Zhao, Christian Bauckhage, and Rafet Sifa. 2025. Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 243–250, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models (Deußer et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/Author-page-Marten-During-lu/2025.coling-industry.20.pdf

PDF Cite Search Fix data