Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders

Jinseok Kim, Jaewon Jung, Sangyeop Kim, Sohhyung Park, Sungzoon Cho


Abstract
Despite the impressive capabilities of Large Language Models (LLMs) in various tasks, their vulnerability to unsafe prompts remains a critical issue. These prompts can lead LLMs to generate responses on illegal or sensitive topics, posing a significant threat to their safe and ethical use. Existing approaches address this issue using classification models, divided into LLM-based and API-based methods. LLM based models demand substantial resources and large datasets, whereas API-based models are cost-effective but might overlook linguistic nuances. With the increasing complexity of unsafe prompts, similarity search-based techniques that identify specific features of unsafe content provide a more robust and effective solution to this evolving problem. This paper investigates the potential of sentence encoders to distinguish safe from unsafe content. We introduce new pairwise datasets and the Cate021 gorical Purity (CP) metric to measure this capability. Our findings reveal both the effectiveness and limitations of existing sentence encoders, proposing directions to improve sentence encoders to operate as robust safety detectors.
Anthology ID:
2024.knowllm-1.13
Volume:
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Sha Li, Manling Li, Michael JQ Zhang, Eunsol Choi, Mor Geva, Peter Hase, Heng Ji
Venues:
KnowLLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
156–170
Language:
URL:
https://aclanthology.org/2024.knowllm-1.13
DOI:
10.18653/v1/2024.knowllm-1.13
Bibkey:
Cite (ACL):
Jinseok Kim, Jaewon Jung, Sangyeop Kim, Sohhyung Park, and Sungzoon Cho. 2024. Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders. In Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024), pages 156–170, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders (Kim et al., KnowLLM-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.knowllm-1.13.pdf