InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie Huang
Abstract
Safety detection has been an increasingly important topic in recent years and it has become even more necessary to develop reliable safety detection systems with the rapid development of large language models. However, currently available safety detection systems have limitations in terms of their versatility and interpretability. In this paper, we first introduce InstructSafety, a safety detection framework that unifies 7 common sub-tasks for safety detection. These tasks are unified into a similar form through different instructions. We then conduct a comprehensive survey of existing safety detection datasets and process 39 human-annotated datasets for instruction tuning. We also construct adversarial samples to enhance the model’s robustness. After fine-tuning Flan-T5 on the collected data, we have developed Safety-Flan-T5, a multidimensional and explainable safety detector. We conduct comprehensive experiments on a variety of datasets and tasks, and demonstrate the strong performance of Safety-Flan-T5 in comparison to supervised baselines and served APIs (Perspective API, ChatGPT and InstructGPT). We will release the processed data, fine-tuned Safety-Flan-T5 and related code for public use.- Anthology ID:
- 2023.findings-emnlp.700
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10421–10436
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.700
- DOI:
- 10.18653/v1/2023.findings-emnlp.700
- Cite (ACL):
- Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, and Minlie Huang. 2023. InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10421–10436, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning (Zhang et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2023.findings-emnlp.700.pdf