Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
Essa Jan, Nouar Aldahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, Yasir Zaki
Abstract
Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model’s overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.- Anthology ID:
- 2025.coling-main.606
- Volume:
- Proceedings of the 31st International Conference on Computational Linguistics
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 9025–9043
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.606/
- DOI:
- Cite (ACL):
- Essa Jan, Nouar Aldahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, and Yasir Zaki. 2025. Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9025–9043, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning (Jan et al., COLING 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.606.pdf