Wenjie Jacky Mo

2025

pdf bib abs
Rethinking Backdoor Detection Evaluation for Language Models
Jun Yan | Wenjie Jacky Mo | Xiang Ren | Robin Jia
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. As a countermeasure, backdoor detection methods aim to detect whether a released model contains a backdoor. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods based on trigger inversion or meta classifiers highly depends on how intensely the model is trained on poisoned data. Specifically, backdoors planted with more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.

Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of Large Language Models (LLMs) deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, this study critically examines the use of demonstrations as a defense mechanism against backdoor attacks in black-box LLMs. With an identified task, we retrieve task-relevant demonstrations from a clean data pool and integrate them with user queries during testing. Importantly, this approach does not necessitate modifications or tuning of the model, nor does it require insight into the model’s internal architecture. The alignment properties inherent in in-context learning play a pivotal role in mitigating the impact of backdoor triggers, effectively recalibrating the behavior of compromised models. Our experimental analysis demonstrates that this method robustly defends against both instance-level and instruction-level backdoor attacks, outperforming existing defense baselines across most evaluation scenarios.

pdf bib abs
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails
Xiaofei Wen | Wenxuan Zhou | Wenjie Jacky Mo | Muhao Chen
Findings of the Association for Computational Linguistics: ACL 2025

Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail’s cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

Co-authors

Venues

findings2
emnlp1

Fix author