Wen-Juan Hou

Also published as: Wen Juan Hou, Juan Wen


2026

The widespread adoption of Large Language Model (LLM) in commercial and research settings has intensified the need for robust intellectual property protection. Backdoor-based LLM fingerprinting has emerged as a promising solution for this challenge. In practical application, the low-cost multi-model collaborative technique, LLM ensemble, combines diverse LLMs to leverage their complementary strengths, garnering significant attention and practical adoption. Unfortunately, the vulnerability of existing LLM fingerprinting for the ensemble scenario is unexplored. In order to comprehensively assess the robustness of LLM fingerprinting, in this paper, we propose two novel fingerprinting attack methods: token filter attack (TFA) and sentence verification attack (SVA). The TFA gets the next token from a unified set of tokens created by the token filter mechanism at each decoding step. The SVA filters out fingerprint responses through a sentence verification mechanism based on perplexity and voting. Experimentally, the proposed methods effectively inhibit the fingerprint response while maintaining ensemble performance. Compared with state-of-the-art attack methods, the proposed method can achieve better performance. The findings necessitate enhanced robustness in LLM fingerprinting.
Training and serving large language models (LLMs) is resource-intensive, making reliable intellectual property (IP) protection and black-box ownership verification increasingly important.Model fingerprinting enables such verification by injecting a small set of secret query–response behaviors, but many existing fingerprints rely on explicit markers or predetermined outputs that are weakly grounded in prompt semantics.This semantic mismatch yields atypical fingerprint responses, reduces stealthiness, and exposes fingerprints to removal by response normalization.We formalize this vulnerability via a new removal attack, Generation Revision Intervention (GRI), which applies system-prompt-level revision and response standardization to steer models toward typical answers, substantially compromising representative injected baselines.To close this semantic gap, we propose the Implicit Fingerprints (ImF): we encode ownership information into a natural-looking target response y via linguistic steganography, then derive a CoT-augmented query x that embeds semantic cues from y to guide the model toward an output sufficiently close to y for decoding-based verification.Experiments on 15 LLMs show that ImF improves stealthiness and remains verifiable under model updates and deployment-time prompt interventions; additional analyses further show stability under common decoding variation and realistic related-model partial merging.

2025

The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness.While, existing methods either focus on model generalization or concentrate on robustness.The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we first empirically reveal an intrinsic mechanism for model generalization and robustness of AIGT detection task.Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action.Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios.Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks.

2015

2013

2006

2004

2003