DAPI: Domain Adaptive Toxicity Probe Vector Intervention, for Fine-Grained Detoxification

Cho Hyeonsu, Dooyoung Kim, Youngjoong Ko


Abstract
There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.
Anthology ID:
2025.findings-acl.779
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15059–15069
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.779/
DOI:
Bibkey:
Cite (ACL):
Cho Hyeonsu, Dooyoung Kim, and Youngjoong Ko. 2025. DAPI: Domain Adaptive Toxicity Probe Vector Intervention, for Fine-Grained Detoxification. In Findings of the Association for Computational Linguistics: ACL 2025, pages 15059–15069, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
DAPI: Domain Adaptive Toxicity Probe Vector Intervention, for Fine-Grained Detoxification (Hyeonsu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.779.pdf