Yanhao Jia
2026
Rethinking Reasoning: A Survey on Reasoning-based Backdoors in LLMs
Man Hu | Xinyi Wu | Zhufeng Suo | Jinbo Feng | Linghui Meng | Yanhao Jia | Anh Tuan Luu | Shuai Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Man Hu | Xinyi Wu | Zhufeng Suo | Jinbo Feng | Linghui Meng | Yanhao Jia | Anh Tuan Luu | Shuai Zhao
Findings of the Association for Computational Linguistics: ACL 2026
With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. While reasoning enhances LLMs’ performance on downstream tasks, it also introduces new threat vectors, as adversaries can leverage these capabilities to conduct backdoor attacks. Prior surveys provide broad overviews of backdoor attacks and reasoning security; however, a systematic survey focused on backdoor attacks and defenses against LLM reasoning is still absent. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also summarize defenses against such attacks and discuss current challenges alongside future research directions.
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao | Xinyi Wu | Shiqian Zhao | Xiaobao Wu | Zhongliang Guo | Yanhao Jia | Anh Tuan Luu
Findings of the Association for Computational Linguistics: ACL 2026
Shuai Zhao | Xinyi Wu | Shiqian Zhao | Xiaobao Wu | Zhongliang Guo | Yanhao Jia | Anh Tuan Luu
Findings of the Association for Computational Linguistics: ACL 2026
Defending Large Language Models (LLMs) against backdoor attacks has long been trapped in a "cat-and-mouse" dilemma, where defenders passively react to ever-shifting attack strategies. To break this cycle, we posit that proactive immunization is inherently superior to reactive sanitization. In this study, we propose Poison-to-Poison (P2P), a general and effective defense algorithm that introduces a paradigm shift. Instead of waiting to detect malicious samples, P2P strategically implants benign triggers to reshape the model’s decision boundary, redirecting latent feature activation from malicious trajectories to a safe, controllable output space. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that P2P can serve as a practical guideline for defending against backdoor attacks in the Model as a Service (MaaS) scenario, where benign prompts are embedded within the system to regulate model behavior.
2025
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM’s Education
Yanhao Jia | Xinyi Wu | Li Hao | QinglinZhang QinglinZhang | Yuxiao Hu | Shuai Zhao | Wenqi Fan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanhao Jia | Xinyi Wu | Li Hao | QinglinZhang QinglinZhang | Yuxiao Hu | Shuai Zhao | Wenqi Fan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In AI-facilitated teaching, leveraging various query styles to interpret abstract text descriptions is crucial for ensuring high-quality teaching. However, current retrieval models primarily focus on natural text-image retrieval, making them insufficiently tailored to educational scenarios due to the ambiguities in the retrieval process. In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions. We introduce the STEM Education Retrieval Dataset (SER), which contains over 24,000 query pairs of different styles, and the Uni-Retrieval, an efficient and style-diversified retrieval vision-language model based on prompt tuning. Uni-Retrieval extracts query style features as prototypes and builds a continuously updated Prompt Bank containing prompt tokens for diverse queries. This bank can updated during test time to represent domain-specific knowledge for different subject retrieval scenarios. Our framework demonstrates scalability and robustness by dynamically retrieving prompt tokens based on prototype similarity, effectively facilitating learning for unknown queries. Experimental results indicate that Uni-Retrieval outperforms existing retrieval models in most retrieval tasks.
Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation
Shuai Zhao | Xiaobao Wu | Cong-Duy T Nguyen | Yanhao Jia | Meihuizi Jia | Feng Yichao | Anh Tuan Luu
Findings of the Association for Computational Linguistics: ACL 2025
Shuai Zhao | Xiaobao Wu | Cong-Duy T Nguyen | Yanhao Jia | Meihuizi Jia | Feng Yichao | Anh Tuan Luu
Findings of the Association for Computational Linguistics: ACL 2025
Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model’s ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct comprehensive experiments on three state-of-the-art large language models and several different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.