2025
pdf
bib
abs
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
Shilong Pan
|
Zhiliang Tian
|
Zhen Huang
|
Wanlong Yu
|
Zhihua Wen
|
Xinwang Liu
|
Kai Lu
|
Minlie Huang
|
Dongsheng Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLMs demonstrate remarkable utility but remain vulnerable to jailbreak attacks that aim to elicit harmful responses. Existing defenses, including post-training alignment and prompt engineering, rely on training on safety-annotated datasets and safe prompt templates, struggling with adaptability to out-of-distribution (OOD) attacks. Steering internal representations of LLMs provides real-time adjustments to defend against OOD attacks. However, it struggles with maintaining model utility, since modifying the representation disrupts the forward pass of inference. It barely considers the competitive objectives of helpfulness and harmlessness in LLMs. We argue that adversarial game-based approaches promise a solution for conflicts between the two objectives. In this paper, we propose **A**dversarial **G**ame **D**efense (AGD), an adversarial game-based defense method that dynamically adjusts LLMs’ internal representations to achieve a balanced trade-off between helpfulness and harmlessness. AGD first proposes an interquartile range (IQR) method to detect abnormal attention weights and correct the abnormal weights via adversarial training. AGD adopts a bi-level optimization to play a two-player variable-sum game to approach Nash Equilibrium (NE), where the two players adversarially refine head activations for helpfulness and harmlessness respectively. Furthermore, AGD applies an expert model to next-token sampling to generate safer responses. Experiments show that AGD significantly improves LLMs’ safety over all baselines.
pdf
bib
abs
LLM-based Rumor Detection via Influence Guided Sample Selection and Game-based Perspective Analysis
Zhiliang Tian
|
Jingyuan Huang
|
Zejiang He
|
Zhen Huang
|
Menglong Lu
|
Linbo Qiao
|
Songzhu Mei
|
Yijie Wang
|
Dongsheng Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rumor detection on social media has become an emerging topic. Traditional deep learning-based methods model rumors based on content, propagation structure, or user behavior, but these approaches are constrained by limited modeling capacity and insufficient training corpora. Recent studies have explored using LLMs for rumor detection through supervised fine-tuning (SFT), but face two issues: 1) unreliable samples sometimes mislead the model learning; 2) the model only learns the most salient input-output mapping and skips in-depth analyses of the rumored content for convenience. To address these issues, we propose an SFT-based LLM rumor detection model with Influence guided Sample selection and Game-based multi-perspective Analysis (ISGA). Specifically, we first introduce the Influence Score (IS) to assess the impact of samples on model predictions and select samples for SFT. We also approximate IS via Taylor expansion to reduce computational complexity. Next, we use LLMs to generate in-depth analyses of news content from multiple perspectives and model their collaborative process for prediction as a cooperative game. Then we utilize the Shapley value to quantify the contribution of each perspective for selecting informative perspective analyses. Experiments show that ISGA excels existing SOTA on three datasets.
pdf
bib
abs
GCML: Gradient Coherence Guided Meta-Learning for Cross-Domain Emerging Topic Rumor Detection
Zejiang He
|
Jingyuan Huang
|
Menglong Lu
|
Zhen Huang
|
Shanshan Liu
|
Zhiliang Tian
|
Dongsheng Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
With the emergence of new topics on social media as sources of rumor propagation, addressing the domain shift between the source and target domain and the target domain samples scarcity remains a crucial task in cross-domain rumor detection. Traditional deep learning-based methods and LLM-based methods are mostly focused on the in-domain condition, thus having poor performance in cross-domain setting. Existing domain adaptation rumor detection approaches ignore the data generalization differences and rely on a large amount of unlabeled target domain samples to achieve domain adaptation, resulting in less effective on emerging topic rumor detection. In this paper, we propose a Gradient Coherence guided Meta-Learning approach (GCML) for emerging topics rumor detection. Firstly, we calculate the task generalization score of each source task (sampled from source domain) from a gradient coherence perspective, and selectively learn more “generalizable” tasks that are more beneficial in adapting to the target domain. Secondly, we leverage meta-learning to alleviate the target domain samples scarcity, which utilizes task generalization scores to re-weight meta-test gradients and adaptively updates learning rate. Extensive experimental results on real-world datasets show that our method substantially outperforms SOTA baselines.
pdf
bib
abs
Correlation-Aware Example Selection for In-Context Learning with Nonsymmetric Determinantal Point Processes
Qiunan Du
|
Zhiliang Tian
|
Zhen Huang
|
Kailun Bian
|
Tianlun Liu
|
Zhaoning Zhang
|
Xinwang Liu
|
Feng Liu
|
Dongsheng Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
LLMs with in-context learning (ICL) obtain remarkable performance but are sensitive to the quality of ICL examples. Prior works on ICL example selection explored unsupervised heuristic methods and supervised LLM-based methods, but they typically focus on the selection of individual examples and ignore correlations among examples. Researchers use the determinantal point process (DPP) to model negative correlations among examples to select diverse examples. However, the DPP fails to model positive correlations among examples, while ICL still requires the positive correlations of examples to ensure the consistency of examples, which provides a clear instruction for LLMs. In this paper, we propose an ICL example selection method based on the nonsymmetric determinantal point process (NDPP) to capture positive and negative correlations, considering both the diversity and the relevance among ICL examples. Specifically, we optimize NDPP via kernel decomposition-based MLE to fit a constructed pseudo-labeled dataset, where we also propose a low-rank decomposition to reduce the computational cost. Further, we perform query-aware kernel adaptation on our NDPP to customize the input query, and we select examples via a MAP inference based on the adapted NDPP. Experimental results show our model outperforms strong baselines in ICL example selection.
pdf
bib
abs
MONTROSE: LLM-driven Monte Carlo Tree Search Self-Refinement for Cross-Domain Rumor Detection
Shanshan Liu
|
Menglong Lu
|
Zhen Huang
|
Zejiang He
|
Liu Liu
|
Zhigang Sun
|
Dongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
With the emergence of new topics on social media as sources of rumor dissemination, addressing the distribution shifts between source and target domains remains a crucial task in cross-domain rumor detection. Existing feature alignment methods, which aim to reduce the discrepancies between domains, are often susceptible to task interference during training. Additionally, data distribution alignment methods, which rely on existing data to synthesize new training samples, inherently introduce noise. To deal with these challenges, a new cross-domain rumor detection method, MONTROSE, is proposed. It combines LLM-driven Monte Carlo Tree Search (MCTS) data synthesis to generate high-quality synthetic data for the target domain and a domain-sharpness-aware (DSAM) self-refinement approach to train rumor detection models with these synthetic data effectively. Experiments demonstrate the superior performance of MONTROSE in cross-domain rumor detection.
2023
pdf
bib
abs
DaMSTF: Domain Adversarial Learning Enhanced Meta Self-Training for Domain Adaptation
Menglong Lu
|
Zhen Huang
|
Yunxiang Zhao
|
Zhiliang Tian
|
Yang Liu
|
Dongsheng Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Self-training emerges as an important research line on domain adaptation. By taking the model’s prediction as the pseudo labels of the unlabeled data, self-training bootstraps the model with pseudo instances in the target domain. However, the prediction errors of pseudo labels (label noise) challenge the performance of self-training. To address this problem, previous approaches only use reliable pseudo instances, i.e., pseudo instances with high prediction confidence, to retrain the model. Although these strategies effectively reduce the label noise, they are prone to miss the hard examples. In this paper, we propose a new self-training framework for domain adaptation, namely Domain adversarial learning enhanced Self-Training Framework (DaMSTF). Firstly, DaMSTF involves meta-learning to estimate the importance of each pseudo instance, so as to simultaneously reduce the label noise and preserve hard examples. Secondly, we design a meta constructor for constructing the meta-validation set, which guarantees the effectiveness of the meta-learning module by improving the quality of the meta-validation set. Thirdly, we find that the meta-learning module suffers from the training guidance vanish- ment and tends to converge to an inferior optimal. To this end, we employ domain adversarial learning as a heuristic neural network initialization method, which can help the meta-learning module converge to a better optimal. Theoretically and experimentally, we demonstrate the effectiveness of the proposed DaMSTF. On the cross-domain sentiment classification task, DaMSTF improves the performance of BERT with an average of nearly 4%.
pdf
bib
abs
GRACE: Gradient-guided Controllable Retrieval for Augmenting Attribute-based Text Generation
Zhihua Wen
|
Zhiliang Tian
|
Zhen Huang
|
Yuxin Yang
|
Zexin Jian
|
Changjian Wang
|
Dongsheng Li
Findings of the Association for Computational Linguistics: ACL 2023
Attribute-based generation methods are of growing significance in controlling the generation of large pre-trained language models (PLMs). Existing studies control the generation by (1) finetuning the model with attributes or (2) guiding the inference processing toward control signals while freezing the PLM. However, finetuning approaches infuse domain bias into generation, making it hard to generate out-of-domain texts. Besides, many methods guide the inference in its word-by-word generation, pushing the word probability to the target attributes, resulting in less fluent sentences. We argue that distilling controlling information from natural texts can produce fluent sentences while maintaining high controllability. In this paper, we propose GRAdient-guided Controllable rEtrieval (GRACE), a retrieval-augmented generation framework to facilitate the generation of fluent sentences with high attribute relevance. GRACE memorizes the semantic and attribute information from unlabeled corpora and applies a controllable retrieval to obtain desired information. For the generation, we design techniques to eliminate the domain bias from the retrieval results and integrate it into the generation model. Additionally, we propose a gradient-guided generation scheme that iteratively steers generation toward higher attribute relevance. Experimental results and quantities of examples verify the effectiveness of our method.