2025
pdf
bib
abs
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
Huaizhi Ge
|
Yiming Li
|
Qifan Wang
|
Yongfeng Zhang
|
Ruixiang Tang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs’ behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs’ generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.
pdf
bib
abs
Understanding the Dark Side of LLMs’ Intrinsic Self-Correction
Qingjie Zhang
|
Di Wang
|
Haoting Qian
|
Yiming Li
|
Tianwei Zhang
|
Minlie Huang
|
Ke Xu
|
Hewu Li
|
Liu Yan
|
Han Qiu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Intrinsic self-correction was initially proposed to improve LLMs’ responses via feedback solely based on their inherent capability. However, recent works show that LLMs’ intrinsic self-correction fails without oracle labels as feedback. In this paper, our research goal is to *interpret LLMs’ intrinsic self-correction for different tasks, especially for those failure cases.* By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT, Llama, and DeepSeek, we design three interpretation methods to reveal the dark side of LLMs’ intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.
pdf
bib
abs
Leveraging Large Language Models for Conversational Multi-Doc Question Answering: The First Place of WSDM Cup 2024
Yiming Li
|
Zhao Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Conversational multi-doc question answering aims to answer specific questions based on the retrieved documents as well as the contextual conversations. In this paper, we introduce our winning approach for the “Conversational Multi-Doc QA” challenge in WSDM Cup 2024, which exploits the superior natural language understanding and generation capability of Large Language Models (LLMs). We first adapt LLMs to the task, then devise a hybrid training strategy to make the most of in-domain unlabeled data. Moreover, an advanced text embedding model is adopted to filter out potentially irrelevant documents, and several approaches are designed and compared for the model ensemble. Equipped with all these techniques, our solution finally ranked 1st place in WSDM Cup 2024, surpassing its rivals to a large extent. The source codes have been released at https://github.com/zhangzhao219/WSDM-Cup-2024.
pdf
bib
abs
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Haoyang Li
|
Xuejia Chen
|
Zhanchao Xu
|
Darian Li
|
Nicole Hu
|
Fei Teng
|
Yiming Li
|
Luyu Qiu
|
Chen Jason Zhang
|
Li Qing
|
Lei Chen
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and multi-step reasoning. NumericBench includes datasets ranging from synthetic number lists to crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: https://github.com/TreeAI-Lab/NumericBench.
2024
pdf
bib
abs
BadActs: A Universal Backdoor Defense in the Activation Space
Biao Yi
|
Sishuo Chen
|
Yiming Li
|
Tong Li
|
Baolei Zhang
|
Zheli Liu
Findings of the Association for Computational Linguistics: ACL 2024
Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance. Extensive experiments on diverse datasets and against diverse attacks (including syntax and style attacks) demonstrate that our defense achieves state-of-the-art performance.
pdf
bib
abs
LLM-Driven Knowledge Injection Advances Zero-Shot and Cross-Target Stance Detection
Zhao Zhang
|
Yiming Li
|
Jin Zhang
|
Hui Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Stance detection aims at inferring an author’s attitude towards a specific target in a text. Prior methods mainly consider target-related background information for a better understanding of targets while neglecting the accompanying input texts. In this study, we propose to prompt Large Language Models (LLMs) to explicitly extract the relationship between paired text and target as contextual knowledge. We then inject such LLM-driven knowledge into a generation model BART to exploit the rich contexts and semantics. Moreover, to further enhance the decoding capability of BART, a novel prototypical contrastive scheme is designed to align input contents with stance labels. Our experimental results demonstrate the state-of-the-art performance across several publicly available datasets, showcasing effectiveness in both zero-shot and cross-target stance detection scenarios. We publicly release our code to facilitate future research.