Probing Toxic Content in Large Pre-Trained Language Models

Nedjma Ousidhoum; Xinran Zhao; Tianqing Fang; Yangqiu Song; Dit-Yan Yeung

doi:10.18653/v1/2021.acl-long.329

Probing Toxic Content in Large Pre-Trained Language Models

Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, Dit-Yan Yeung

Abstract

Large pre-trained language models (PTLMs) have been shown to carry biases towards different social groups which leads to the reproduction of stereotypical and toxic content by major NLP systems. We propose a method based on logistic regression classifiers to probe English, French, and Arabic PTLMs and quantify the potentially harmful content that they convey with respect to a set of templates. The templates are prompted by a name of a social group followed by a cause-effect relation. We use PTLMs to predict masked tokens at the end of a sentence in order to examine how likely they enable toxicity towards specific communities. We shed the light on how such negative content can be triggered within unrelated and benign contexts based on evidence from a large-scale study, then we explain how to take advantage of our methodology to assess and mitigate the toxicity transmitted by PTLMs.

Anthology ID:: 2021.acl-long.329
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4262–4274
Language:
URL:: https://aclanthology.org/2021.acl-long.329
DOI:: 10.18653/v1/2021.acl-long.329
Bibkey:
Cite (ACL):: Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, and Dit-Yan Yeung. 2021. Probing Toxic Content in Large Pre-Trained Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4262–4274, Online. Association for Computational Linguistics.
Cite (Informal):: Probing Toxic Content in Large Pre-Trained Language Models (Ousidhoum et al., ACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2021.acl-long.329.pdf
Data: ATOMIC, Hate Speech and Offensive Language

PDF Cite Search