Jiacheng Liang
2025
Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise
Xiaoqun Liu
|
Jiacheng Liang
|
Luoxi Tang
|
Muchao Ye
|
Weicheng Ma
|
Zhaohan Xi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are widely adapted for downstream applications through fine-tuning, a process named customization. However, recent studies have identified a vulnerability during this process, where malicious samples can compromise the robustness of LLMs and amplify harmful behaviors. To address this challenge, we propose an adaptive data curation approach allowing any text to be curated to enhance its effectiveness in counteracting harmful samples during customization. To avoid the need for additional defensive modules, we further introduce a comprehensive mitigation framework spanning the lifecycle of the customization process: before customization to immunize LLMs against future compromise attempts, during customization to neutralize risks, and after customization to restore compromised models. Experimental results demonstrate a significant reduction in compromising effects, achieving up to a 100% success rate in generating safe responses. By combining adaptive data curation with lifecycle-based mitigation strategies, this work represents a solid step forward in mitigating compromising risks and ensuring the secure adaptation of LLMs.
Watermark under Fire: A Robustness Evaluation of LLM Watermarking
Jiacheng Liang
|
Zian Wang
|
Spencer Hong
|
Shouling Ji
|
Ting Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Various watermarking methods (“watermarkers”) have been proposed to identify LLM-generated texts; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, by leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. We further explore the best practices to operate watermarkers in adversarial environments. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research.
Search
Fix author
Co-authors
- Spencer Hong 1
- Shouling Ji 1
- Xiaoqun Liu 1
- Weicheng Ma 1
- Luoxi Tang 1
- show all...