InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning

Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie Huang


Abstract
Safety detection has been an increasingly important topic in recent years and it has become even more necessary to develop reliable safety detection systems with the rapid development of large language models. However, currently available safety detection systems have limitations in terms of their versatility and interpretability. In this paper, we first introduce InstructSafety, a safety detection framework that unifies 7 common sub-tasks for safety detection. These tasks are unified into a similar form through different instructions. We then conduct a comprehensive survey of existing safety detection datasets and process 39 human-annotated datasets for instruction tuning. We also construct adversarial samples to enhance the model’s robustness. After fine-tuning Flan-T5 on the collected data, we have developed Safety-Flan-T5, a multidimensional and explainable safety detector. We conduct comprehensive experiments on a variety of datasets and tasks, and demonstrate the strong performance of Safety-Flan-T5 in comparison to supervised baselines and served APIs (Perspective API, ChatGPT and InstructGPT). We will release the processed data, fine-tuned Safety-Flan-T5 and related code for public use.
Anthology ID:
2023.findings-emnlp.700
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10421–10436
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.700
DOI:
10.18653/v1/2023.findings-emnlp.700
Bibkey:
Cite (ACL):
Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, and Minlie Huang. 2023. InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10421–10436, Singapore. Association for Computational Linguistics.
Cite (Informal):
InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning (Zhang et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2023.findings-emnlp.700.pdf