Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan; Nouar Aldahoul; Moiz Ali; Faizan Ahmad; Fareed Zaffar; Yasir Zaki

Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan, Nouar Aldahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, Yasir Zaki

Abstract

Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model’s overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.

Anthology ID:: 2025.coling-main.606
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9025–9043
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.606/
DOI:
Bibkey:
Cite (ACL):: Essa Jan, Nouar Aldahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, and Yasir Zaki. 2025. Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9025–9043, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning (Jan et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.606.pdf

PDF Cite Search Fix data