Xi Zhang

Other people with similar names: Xi Zhang

Unverified author pages with similar names: Xi Zhang


2026

Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful intent. CDD emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning algorithm (EG-GRPO) is introduced to encourage exploration of new types and variants of meta-operations. Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks.

2025

Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on “factual statements” that rephrase source materials while overlooking “cognitive statements” that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cognitive statements remains challenging. Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics. To keep pace with rapidly evolving LLMs, we further develop an automatic annotation pipeline that scales easily across different models. This results in a large-scale CogniBench-L dataset, which facilitates training accurate detectors for both factual and cognitive hallucinations. We release our model and datasets at: https://github.com/FUTUREEEEEE/CogniBench
Large language models (LLMs) have shown strong potential in complex reasoning tasks. However, as task complexity increases, their performance often degrades, resulting in hallucinations, errors, and logical inconsistencies. To enhance reasoning capabilities, Monte Carlo Tree Search (MCTS) has been introduced to guide the exploration of reasoning paths in a structured manner. Despite its advantages, traditional MCTS relies on fixed reasoning strategies, limiting the diversity of reasoning paths and the coverage of the solution space. To address these limitations, we propose Dynamic Strategy-Guided MCTS (DSG-MCTS), a novel framework that dynamically integrates multiple reasoning strategies, such as abductive and analogical reasoning, to expand the reasoning space. At the same time, DSG-MCTS enhances reasoning efficiency through a dynamic strategy selection mechanism that adapts to the task context. Experimental results on challenging reasoning benchmarks demonstrate that DSG-MCTS achieves improved accuracy and efficiency, outperforming existing state-of-the-art methods.