Qinghua Li

2026

DiSec: Mitigating Backdoors in Pre-trained Language Models via Disentanglement of Adversarial Weights for Secure Fine-Tuning
Sunanda Das | Qinghua Li
Findings of the Association for Computational Linguistics: ACL 2026

Task-agnostic backdoor attacks can contaminate pre-trained language models (PLMs) in a way that survives downstream adaptation, even under full fine-tuning, making it difficult for practitioners to trust third-party checkpoints. Existing defenses often rely on privileged assumptions (e.g., access to poisoned data or trigger/target knowledge), thereby limiting their applicability in realistic settings. We present DiSec, a robust and label-efficient purification framework that uses only clean auxiliary text and does not rely on downstream supervision or attack signatures. DiSec elicits model-internal signals from this clean data to separate suspicious parameter components that are inconsistent with benign behavior, and then flags anomalous structures by jointly leveraging complementary spectral and generative views of outliers. Finally, DiSec performs a structure-preserving repair via layer-local prototype-based mean correction, yielding an idempotent update that depends only on non-adversarial statistics. Across diverse downstream classification tasks and PLM backdoor strategies, DiSec substantially suppresses attack success while preserving clean-task utility, offering a practical path to securing fully fine-tuned PLMs before deployment. The codes are publicly available at https://github.com/das-sunanda/DiSec.

2025

pdf bib abs

Seeing Through the Mask: AI-Generated Text Detection with Similarity-Guided Graph Reasoning
Nidhi Gupta | Qinghua Li
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The rise of generative AI has led to challenges in distinguishing AI-generated text from human-written content, raising concerns about misinformation and content authenticity. Detecting AI-generated text remains challenging, especially under various stylistic domains and paraphrased inputs. We introduce SGG-ATD, a novel detection framework that models structural and contextual relationships between LLM-predicted and original-input text. By masking parts of the input and reconstructing them using a language model, we capture implicit coherence patterns. These are encoded in a graph where cosine and contextual links between keywords guide classification via a Graph Convolutional Network (GCN). SGG-ATD achieves strong performance across diverse datasets and shows resilience to adversarial rephrasing and out-of-distribution inputs, outperforming competitive baselines.

Co-authors

Sunanda Das 1
Nidhi Gupta 1

Venues

Findings2

Fix author