Realistic Evaluation of Toxicity in Large Language Models

Tinh Luong, Thanh-Thien Le, Linh Ngo, Thien Nguyen


Abstract
Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.
Anthology ID:
2024.findings-acl.61
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1038–1047
Language:
URL:
https://aclanthology.org/2024.findings-acl.61
DOI:
Bibkey:
Cite (ACL):
Tinh Luong, Thanh-Thien Le, Linh Ngo, and Thien Nguyen. 2024. Realistic Evaluation of Toxicity in Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 1038–1047, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Realistic Evaluation of Toxicity in Large Language Models (Luong et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.findings-acl.61.pdf