Zirui Hu

2026

Large language models (LLMs) have achieved remarkable performance across diverse tasks, largely driven by large-scale pretraining. However, this data abundance introduces test data contamination, where benchmark datasets overlap with pretraining corpora, undermining the reliability of model evaluation by confounding memorization with genuine generalization. To mitigate this issue, existing training data detectors attempt to identify clean (unseen) samples from contaminated test sets, but often suffer from residual contamination due to the black-box nature of LLMs. As a result, contaminated data may be mistakenly retained, leading to unreliable evaluation.To address this challenge, we propose FTD (FDR-controlled Training Data detection), a principled framework that detects and filters contaminated evaluation data while providing a statistical guarantee: the proportion of contaminated samples mistakenly retained as clean, the false discovery rate (FDR), is provably controlled below a user-specified threshold. FTD combines multiple complementary detectors via an adaptive weighting strategy, and we theoretically show it achieves high statistical power under valid FDR control. Extensive experiments on real-world benchmarks demonstrate that FTD significantly reduces residual contamination compared to existing methods while preserving evaluation consistency.

pdf bib abs

SAFER: A Controllable Safeguard for LLMs against Backdoor Attacks
Zirui Hu | Zheng Zhang | Yingjie Wang | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2026

Large language models (LLMs) have achieved remarkable performance across a wide range of natural language processing (NLP) tasks. However, they remain susceptible to backdoor attacks, where adversaries embed hidden triggers in the input to induce malicious, attacker-specified behaviors. While existing inference-time defenses aim to mitigate such threats by detecting and filtering poisoned inputs, they often lack explicit control over the false acceptance rate (FAR)—a critical requirement in safety-sensitive settings where even rare failures can lead to catastrophic consequences. To address this challenge, we propose SAFER, a novel inference-time defense framework that provides explicit and provable control over FAR without requiring prior knowledge of backdoor samples. SAFER leverages distributional information from available data to estimate the likelihood that an input is clean and selects inputs accordingly. From a theoretical perspective, we demonstrate that SAFER asymptotically guarantees control of the true FAR. Empirical evaluations on three benchmark datasets across diverse backdoor attack scenarios show that SAFER consistently achieves reliable FAR control while maintaining high detection power, significantly outperforming existing inference-time defenses.

Co-authors

Rui Li 1

Qi Liu 1

Venues

ACL1
Findings1

Fix author