Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Hao Li (李浩); Lijun Li; Zhenghao Lu; Xianyi Wei; Rui Li; Jing Shao; Lei Sha

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha

Abstract

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated.

Anthology ID:: 2025.emnlp-main.406
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8041–8061
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.406/
DOI:
Bibkey:
Cite (ACL):: Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, and Lei Sha. 2025. Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8041–8061, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment (Li et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.406.pdf
Checklist:: 2025.emnlp-main.406.checklist.pdf

PDF Cite Search Checklist Fix data