Jing Shao
2026
HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
Yuexiao Liu | Lijun Li | Xingjun Wang | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuexiao Liu | Lijun Li | Xingjun Wang | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have gained significant attention due to their objective and verifiable reward signals, demonstrating strong performance in reasoning and code generation tasks. However, the potential safety risks associated with RLVR remain underexplored. This paper presents HarmRLVR, the first systematic investigation into the alignment reversibility risk of RLVR. We show that safety alignment can be rapidly reversed using GRPO with merely 64 harmful prompts without responses, causing models to readily comply with harmful instructions. Across five models from Llama, Qwen, and DeepSeek, we empirically demonstrate that RLVR-based attacks elevate the average harmfulness score to 4.94 with an attack success rate of 96.01%, significantly outperforming harmful fine-tuning while preserving general capabilities. Our findings reveal that RLVR can be efficiently exploited for harmful alignment, posing serious threats to open-source model safety.
ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
Yutao Mou | Zhangchi Xue | Lijun Li | Peiyang Liu | Shikun Zhang | Wei Ye | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2026
Yutao Mou | Zhangchi Xue | Lijun Li | Peiyang Liu | Shikun Zhang | Wei Ye | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2026
While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains underexplored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action–attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, We introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65% on average and improves benign task completion by approximately 10% under prompt injection attacks.
ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging
Junyao Yang | Chen Qian | Wen Shen | Yong Liu | Jing Shao | Dongrui Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junyao Yang | Chen Qian | Wen Shen | Yong Liu | Jing Shao | Dongrui Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as "Reasoning + X", remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning–domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes "Reasoning + X" capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.
LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Xuhao Hu | Wang Peng | Xiaoya Lu | Dongrui Liu | Xuanjing Huang | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2026
Xuhao Hu | Wang Peng | Xiaoya Lu | Dongrui Liu | Xuanjing Huang | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2026
Previous research has shown that LLMs finetuned on incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Furthermore, we simulate both benign and biased users to interact with the assistant LLM, producing 20k trajectories for self-training in a more practical human-AI interaction environment. Notably, we find that the assistant model can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty under high-stakes scenarios, and highlight that this risk arises not only through direct finetuning, but also in downstream mixture tasks and human-AI interactions.
SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents
Xinshun Feng | Xinhao Song | Lijun Li | Gongshen Liu | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinshun Feng | Xinhao Song | Lijun Li | Gongshen Liu | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon task completion. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel form of state abstraction that facilitates the aggregation of actions within functionally analogous contexts, such as tool reuse. Consequently, agents not only extract explicit knowledge from historical data but also leverage inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and complex search tasks, demonstrating its effectiveness in achieving more practical and efficient agentic learning.
Evolutionary Guided Decoding: Iterative Value Refinement for LLMs
Zhenhua Liu | Lijun Li | Ruizhe Chen | Yuxian Jiang | Tong Zhu | Zhaochen Su | Wenliang Chen | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhenhua Liu | Lijun Li | Ruizhe Chen | Yuxian Jiang | Tong Zhu | Zhaochen Su | Wenliang Chen | Jing Shao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While guided decoding, especially value-guided methods, has emerged as a cost-effective alternative for controlling language model outputs without re-training models, its effectiveness is limited by the accuracy of the value function. We identify that this inaccuracy stems from a core distributional gap: existing methods train static value functions on trajectories sampled exclusively from the base policy, which inherently confines their training to a narrow and suboptimal view of the potential output space. We propose Iterative Value Refinement, a novel framework designed to bridge this gap. It employs Value Exploration to provide a more comprehensive and robust training signal, complemented by Iterative Self-Refinement, which uses the improved value function from one iteration to guide the generation of higher-quality data for the next. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of our framework in aligning language models. Our approach not only achieves alignment but also significantly reduces computational costs by leveraging principled value function optimization for efficient and effective control.
2025
EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models
Xiao Yu | Yi Yu | Dongrui Liu | Kejiang Chen | Weiming Zhang | Nenghai Yu | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2025
Xiao Yu | Yi Yu | Dongrui Liu | Kejiang Chen | Weiming Zhang | Nenghai Yu | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2025
With the widespread of Large Language Models (LLMs), there has been an increasing need to detect LLM-generated texts, prompting extensive research in this area. However, existing detection methods mainly evaluate on static benchmarks, which neglect the evolving nature of LLMs. Relying on existing static benchmarks could create a misleading sense of security, overestimating the real-world effectiveness of detection methods.To bridge this gap, we introduce EvoBench, a dynamic benchmark considering a new dimension of generalization across continuously evolving LLMs.EvoBench categorizes the evolving LLMs into (1) updates over time and (2) developments like finetuning and pruning, covering 7 LLM families and their 29 evolving versions. To measure the generalization across evolving LLMs, we introduce a new EMG (Evolving Model Generalization) metric. Our evaluation of 14 detection methods on EvoBench reveals that they all struggle to maintain generalization when confronted with evolving LLMs. To mitigate the generalization problems, we further propose improvement strategies. For zero-shot detectors, we propose pruning the scoring model to extract shared features. For supervised detectors, we also propose a practical training strategy.Our research sheds light on critical challenges in real-world LLM-generated text detection and represents a significant step toward practical applications.
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
Jingen Qu | Lijun Li | Bo Zhang | Yichen Yan | Jing Shao
Findings of the Association for Computational Linguistics: EMNLP 2025
Jingen Qu | Lijun Li | Bo Zhang | Yichen Yan | Jing Shao
Findings of the Association for Computational Linguistics: EMNLP 2025
Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35,610 image–text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets. Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Qianli Ma | Dongrui Liu | Qian Chen | Linfeng Zhang | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qianli Ma | Dongrui Liu | Qian Chen | Linfeng Zhang | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: neuron misidentification due to simplistic parameter magnitude-based selection, and cross-task neuron interference during merging.To address these challenges, we propose LED-Merging, a three-stage framework that Locates task-specific neurons via gradient-based attribution, dynamically Elects critical neurons through multi-model importance fusion, and Disjoints conflicting updates through parameter isolation.Extensive experiments on Llama-3-8B, Mistral-7B, and Llama2-13B demonstrate that LED-Merging effectively reduces harmful response rates, showing a 31.4% decrease on Llama-3-8B-Instruct on HarmBench, while simultaneously preserving 95% of utility performance, such as achieving 52.39% accuracy on GSM8K.LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs.Code is available at https://github.com/MqLeet/LED-Merging
The Tug of War Within: Mitigating the Fairness-Privacy Conflicts in Large Language Models
Chen Qian | Dongrui Liu | Jie Zhang | Yong Liu | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chen Qian | Dongrui Liu | Jie Zhang | Yong Liu | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ensuring awareness of fairness and privacy in Large Language Models (LLMs) is critical. Interestingly, we discover a counter-intuitive trade-off phenomenon that enhancing an LLM’s privacy awareness through Supervised Fine-Tuning (SFT) methods significantly decreases its fairness awareness with thousands of samples. To address this issue, inspired by the information theory, we introduce a training-free method to Suppress the Privacy and faIrness coupled Neurons (SPIN), which theoretically and empirically decrease the mutual information between fairness and privacy awareness. Extensive experimental results demonstrate that SPIN eliminates the trade-off phenomenon and significantly improves LLMs’ fairness and privacy awareness simultaneously without compromising general capabilities, e.g., improving Qwen-2-7B-Instruct’s fairness awareness by 12.2% and privacy awareness by 14.0%.More crucially, SPIN remains robust and effective with limited annotated data or even when only malicious fine-tuning data is available, whereas SFT methods may fail to perform properly in such scenarios. Furthermore, we show that SPIN could generalize to other potential trade-off dimensions.We hope this study provides valuable insights into concurrently addressing fairness and privacy concerns in LLMs and can be integrated into comprehensive frameworks to develop more ethical and responsible AI systems. Our code is available at https://github.com/ChnQ/SPIN.
VLSBench: Unveiling Visual Leakage in Multimodal Safety
Xuhao Hu | Dongrui Liu | Hao Li | Xuanjing Huang | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xuhao Hu | Dongrui Liu | Hao Li | Xuanjing Huang | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counterintuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs aligned with image-text pairs. To explain such a phenomenon, we discover a Visual Safety Information Leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky content in the image has been revealed in the textual query. Thus, MLLMs can easily refuse these sensitive image-text pairs according to textual queries only, leading to unreliable cross-modality safety evaluation of MLLMs. We also conduct a further comparison experiment between textual alignment and multimodal alignment to highlight this drawback. To this end, we construct Visual Leakless Safety Bench (VLSBench) with 2.2k image-text pairs through an automated data pipeline. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, i.e., LLaVA, Qwen2-VL and GPT-4o. Besides, we empirically compare textual and multimodal alignment methods on VLSBench and find that textual alignment is effective enough for multimodal safety scenarios with VSIL, while multimodal alignment is preferable for safety scenarios without VSIL.
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability
Xiaoya Lu | Dongrui Liu | Yi Yu | Luxin Xu | Jing Shao
Findings of the Association for Computational Linguistics: EMNLP 2025
Xiaoya Lu | Dongrui Liu | Yi Yu | Luxin Xu | Jing Shao
Findings of the Association for Computational Linguistics: EMNLP 2025
With the widespread application of large language models (LLMs) across various domains, techniques for enhancing their security have progressed rapidly. In this paper, we reveal that although existing defense methods can improve the robustness of LLMs against jailbreaks, they compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of LLM mechanism interpretability, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against both single-turn and multi-turn jailbreak attacks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training.
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Miao Ziqi | Yi Ding | Lijun Li | Jing Shao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Miao Ziqi | Yi Ding | Lijun Li | Jing Shao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
With the emergence of strong vision language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments.Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: vision-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack.VisCo fabricates contextual dialogue using four distinct vision-focused strategies, dynamically generating auxiliary images when necessary to construct a vision-centric jailbreak scenario.To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which achieves a toxicity score of 2.48 and an ASR of 22.2%. Code: https://github.com/Dtc7w3PQ/Visco-Attack.
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
Yubo Zhu | Dongrui Liu | Zecheng Lin | Wei Tong | Sheng Zhong | Jing Shao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yubo Zhu | Dongrui Liu | Zecheng Lin | Wei Tong | Sheng Zhong | Jing Shao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Qibing Ren | Hao Li | Dongrui Liu | Zhanxu Xie | Xiaoya Lu | Yu Qiao | Lei Sha | Junchi Yan | Lizhuang Ma | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qibing Ren | Hao Li | Dongrui Liu | Zhanxu Xie | Xiaoya Lu | Yu Qiao | Lei Sha | Junchi Yan | Lizhuang Ma | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to natural distribution shifts between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, ActorBreaker, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour’s actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
Hao Li | Lijun Li | Zhenghao Lu | Xianyi Wei | Rui Li | Jing Shao | Lei Sha
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hao Li | Lijun Li | Zhenghao Lu | Xianyi Wei | Rui Li | Jing Shao | Lei Sha
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated.
2024
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Chen Qian | Jie Zhang | Wei Yao | Dongrui Liu | Zhenfei Yin | Yu Qiao | Yong Liu | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2024
Chen Qian | Jie Zhang | Wei Yao | Dongrui Liu | Zhenfei Yin | Yu Qiao | Yong Liu | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2024
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs’ trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs’ trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM’s pre-training checkpoints to enhance the LLM’s trustworthiness. Finally, inspired by the theoretical result that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field.
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
Qibing Ren | Chang Gao | Jing Shao | Junchi Yan | Xin Tan | Wai Lam | Lizhuang Ma
Findings of the Association for Computational Linguistics: ACL 2024
Qibing Ren | Chang Gao | Jing Shao | Junchi Yan | Xin Tan | Wai Lam | Lizhuang Ma
Findings of the Association for Computational Linguistics: ACL 2024
The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a new and universal safety vulnerability of these models against code input: CodeAttack bypasses the safety guardrails of all models more than 80% of the time. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures. Furthermore, we give our hypotheses about the success of CodeAttack: the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk. Finally, we analyze potential mitigation measures. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
Zaibin Zhang | Yongting Zhang | Lijun Li | Hongzhi Gao | Lijun Wang | Huchuan Lu | Feng Zhao | Yu Qiao | Jing Shao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zaibin Zhang | Yongting Zhang | Lijun Li | Hongzhi Gao | Lijun Wang | Huchuan Lu | Feng Zhao | Yu Qiao | Jing Shao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. To date, comprehensive research on the safety issues associated with multi-agent systems remains limited. In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety.To tackle these concerns, we propose a comprehensive framework (PsySafe) grounded in agent psychology, focusing on three key areas: firstly, identifying how dark personality traits in agents can lead to risky behaviors; secondly, evaluating the safety of multi-agent systems from the psychological and behavioral perspectives, and thirdly, devising effective strategies to mitigate these risks.Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents’ self-reflection when engaging in dangerous behavior, and the correlation between agents’ psychological assessments and dangerous behaviors. We anticipate that our framework and observations will provide valuable insights for further research into the safety of multi-agent systems. We make our data and code publicly accessible at https://github.com/AI4Good24/PsySafe.
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li | Bowen Dong | Ruohui Wang | Xuhao Hu | Wangmeng Zuo | Dahua Lin | Yu Qiao | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2024
Lijun Li | Bowen Dong | Ruohui Wang | Xuhao Hu | Wangmeng Zuo | Dahua Lin | Yu Qiao | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2024
In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose SALAD-Bench, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
Zhanhui Zhou | Jie Liu | Jing Shao | Xiangyu Yue | Chao Yang | Wanli Ouyang | Yu Qiao
Findings of the Association for Computational Linguistics: ACL 2024
Zhanhui Zhou | Jie Liu | Jing Shao | Xiangyu Yue | Chao Yang | Wanli Ouyang | Yu Qiao
Findings of the Association for Computational Linguistics: ACL 2024
A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences. Recent approaches therefore prefer customization, gathering multi-dimensional feedback, and creating distinct reward models for each dimension.Different language models are then optimized for various preferences using multi-objective RLHF (MORLHF) with varying reward weights.However, RL fine-tuning is unstable and resource-heavy, especially with diverse and usually conflicting objectives.In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free extension of Direct Preference Optimization (DPO) for multiple alignment objectives.Essentially, MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models that combine all objectives with specific weights. MODPO theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.Empirical results in safety alignment and long-form question answering show that MODPO matches or outperforms existing methods, producing a Pareto front of language models catering to diverse preferences with three times less computational resources compared to MORLHF.Code is available at https://github.com/ZHZisZZ/modpo.
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong | Zhanhui Zhou | Chao Yang | Jing Shao | Yu Qiao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Zhichen Dong | Zhanhui Zhou | Chao Yang | Jing Shao | Yu Qiao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.
2022
Prompt for Extraction? PAIE: Prompting Argument Interaction for Event Argument Extraction
Yubo Ma | Zehao Wang | Yixin Cao | Mukai Li | Meiqi Chen | Kun Wang | Jing Shao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yubo Ma | Zehao Wang | Yixin Cao | Mukai Li | Meiqi Chen | Kun Wang | Jing Shao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In this paper, we propose an effective yet efficient model PAIE for both sentence-level and document-level Event Argument Extraction (EAE), which also generalizes well when there is a lack of training data. On the one hand, PAIE utilizes prompt tuning for extractive objectives to take the best advantages of Pre-trained Language Models (PLMs). It introduces two span selectors based on the prompt to select start/end tokens among input texts for each role. On the other hand, it captures argument interactions via multi-role prompts and conducts joint optimization with optimal span assignments via a bipartite matching loss. Also, with a flexible prompt design, PAIE can extract multiple arguments with the same role instead of conventional heuristic threshold tuning. We have conducted extensive experiments on three benchmarks, including both sentence- and document-level EAE. The results present promising improvements from PAIE (3.5% and 2.3% F1 gains in average on three benchmarks, for PAIE-base and PAIE-large respectively). Further analysis demonstrates the efficiency, generalization to few-shot settings, and effectiveness of different extractive prompt tuning strategies. Our code is available at https://github.com/mayubo2333/PAIE.
R2F: A General Retrieval, Reading and Fusion Framework for Document-level Natural Language Inference
Hao Wang | Yixin Cao | Yangguang Li | Zhen Huang | Kun Wang | Jing Shao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Hao Wang | Yixin Cao | Yangguang Li | Zhen Huang | Kun Wang | Jing Shao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Document-level natural language inference (DOCNLI) is a new challenging task in natural language processing, aiming at judging the entailment relationship between a pair of hypothesis and premise documents. Current datasets and baselines largely follow sentence-level settings, but fail to address the issues raised by longer documents. In this paper, we establish a general solution, named Retrieval, Reading and Fusion (R2F) framework, and a new setting, by analyzing the main challenges of DOCNLI: interpretability, long-range dependency, and cross-sentence inference. The basic idea of the framework is to simplify document-level task into a set of sentence-level tasks, and improve both performance and interpretability with the power of evidence. For each hypothesis sentence, the framework retrieves evidence sentences from the premise, and reads to estimate its credibility. Then the sentence-level results are fused to judge the relationship between the documents. For the setting, we contribute complementary evidence and entailment label annotation on hypothesis sentences, for interpretability study. Our experimental results show that R2F framework can obtain state-of-the-art performance and is robust for diverse evidence retrieval methods. Moreover, it can give more interpretable prediction results. Our model and code are released at https://github.com/phoenixsecularbird/R2F.
MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities
Yubo Ma | Zehao Wang | Mukai Li | Yixin Cao | Meiqi Chen | Xinze Li | Wenqi Sun | Kunquan Deng | Kun Wang | Aixin Sun | Jing Shao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Yubo Ma | Zehao Wang | Mukai Li | Yixin Cao | Meiqi Chen | Xinze Li | Wenqi Sun | Kunquan Deng | Kun Wang | Aixin Sun | Jing Shao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Events are fundamental building blocks of real-world happenings. In this paper, we present a large-scale, multi-modal event knowledge graph named MMEKG. MMEKG unifies different modalities of knowledge via events, which complement and disambiguate each other. Specifically, MMEKG incorporates (i) over 990 thousand concept events with 644 relation types to cover most types of happenings, and (ii) over 863 million instance events connected through 934 million relations, which provide rich contextual information in texts and/or images. To collect billion-scale instance events and relations among them, we additionally develop an efficient yet effective pipeline for textual/visual knowledge extraction system. We also develop an induction strategy to create million-scale concept events and a schema organizing all events and relations in MMEKG. To this end, we also provide a pipeline enabling our system to seamlessly parse texts/images to event graphs and to retrieve multi-modal knowledge at both concept- and instance-levels.
ERGO: Event Relational Graph Transformer for Document-level Event Causality Identification
Meiqi Chen | Yixin Cao | Kunquan Deng | Mukai Li | Kun Wang | Jing Shao | Yan Zhang
Proceedings of the 29th International Conference on Computational Linguistics
Meiqi Chen | Yixin Cao | Kunquan Deng | Mukai Li | Kun Wang | Jing Shao | Yan Zhang
Proceedings of the 29th International Conference on Computational Linguistics
Document-level Event Causality Identification (DECI) aims to identify event-event causal relations in a document. Existing works usually build an event graph for global reasoning across multiple sentences. However, the edges between events have to be carefully designed through heuristic rules or external tools. In this paper, we propose a novel Event Relational Graph TransfOrmer (ERGO) framework for DECI, to ease the graph construction and improve it over the noisy edge issue. Different from conventional event graphs, we define a pair of events as a node and build a complete event relational graph without any prior knowledge or tools. This naturally formulates DECI as a node classification problem, and thus we capture the causation transitivity among event pairs via a graph transformer. Furthermore, we design a criss-cross constraint and an adaptive focal loss for the imbalanced classification, to alleviate the issues of false positives and false negatives. Extensive experiments on two benchmark datasets show that ERGO greatly outperforms previous state-of-the-art (SOTA) methods (12.8% F1 gains on average).
Search
Fix author
Co-authors
- Dongrui Liu 10
- Lijun Li 9
- Yu Qiao 6
- Yixin Cao 4
- Kun Wang 4
- Meiqi Chen 3
- Xuhao Hu 3
- Mukai Li 3
- Hao Li 3
- Yong Liu 3
- Xiaoya Lu 3
- Chen Qian 3
- Kunquan Deng 2
- Xuan-Jing Huang (黄萱菁) 2
- Lizhuang Ma 2
- Yubo Ma 2
- Qibing Ren 2
- Lei Sha 2
- Zehao Wang 2
- Junchi Yan 2
- Chao Yang 2
- Jie Zhang 2
- Zhanhui Zhou 2
- Kejiang Chen 1
- Qian Chen 1
- Ruizhe Chen 1
- Wenliang Chen (陈文亮) 1
- Yi Ding 1
- Bowen Dong 1
- Zhichen Dong 1
- Xinshun Feng 1
- Chang Gao 1
- Hongzhi Gao 1
- Zhen Huang 1
- Yuxian Jiang 1
- Wai Lam 1
- Yangguang Li 1
- Xinze Li 1
- Rui Li 1
- Dahua Lin 1
- Zecheng Lin 1
- Yuexiao Liu 1
- Peiyang Liu 1
- Gongshen Liu 1
- Zhenhua Liu 1
- Jie Liu 1
- Huchuan Lu 1
- Zhenghao Lu 1
- Qianli Ma 1
- Yutao Mou 1
- Wanli Ouyang 1
- Wang Peng 1
- Jingen Qu 1
- Wen Shen 1
- Xinhao Song 1
- Zhaochen Su 1
- Wenqi Sun 1
- Aixin Sun 1
- Xin Tan 1
- Wei Tong 1
- Xingjun Wang 1
- Lijun Wang 1
- Hao Wang 1
- Ruohui Wang 1
- Xianyi Wei 1
- Zhanxu Xie 1
- Luxin Xu 1
- Zhangchi Xue 1
- Yichen Yan 1
- Junyao Yang 1
- Wei Yao 1
- Wei Ye 1
- Zhenfei Yin 1
- Xiao Yu 1
- Yi Yu 1
- Nenghai Yu 1
- Yi Yu 1
- Xiangyu Yue 1
- Weiming Zhang 1
- Bo Zhang 1
- Linfeng Zhang 1
- Zaibin Zhang 1
- Yongting Zhang 1
- Shikun Zhang 1
- Yan Zhang 1
- Feng Zhao 1
- Sheng Zhong 1
- Yubo Zhu 1
- Tong Zhu (朱桐) 1
- Miao Ziqi 1
- Wangmeng Zuo 1