Tianyu Du
2026
“I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns?
Naen Xu | Jiayi Sheng | Changjiang Li | Chunyi Zhou | Yuyuan Li | Tianyu Du | Jun Wang | Zhihui Fu | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Naen Xu | Jiayi Sheng | Changjiang Li | Chunyi Zhou | Yuyuan Li | Tianyu Du | Jun Wang | Zhihui Fu | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.
ACIArena: Toward Unified Evaluation for Agent Cascading Injection
Hengyu An | Minxi Li | Jinghuai Zhang | Naen Xu | Chunyi Zhou | Changjiang Li | Xiaogang Xu | Tianyu Du | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hengyu An | Minxi Li | Jinghuai Zhang | Naen Xu | Chunyi Zhou | Changjiang Li | Xiaogang Xu | Tianyu Du | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Collaboration and information sharing empower Multi-Agent Systems (MAS) but also introduce a critical security risk known as Agent Cascading Injection (ACI). In such attacks, a compromised agent exploits inter-agent trust to propagate malicious instructions, causing cascading failures across the system. However, existing studies consider only limited attack strategies and simplified MAS settings, limiting their generalizability and comprehensive evaluation. To bridge this gap, we introduce ACIArena, a unified framework for evaluating the robustness of MAS. ACIArena offers systematic evaluation suites spanning multiple attack surfaces (i.e., external inputs, agent profiles, inter-agent messages) and attack objectives (i.e., instruction hijacking, task disruption, information exfiltration). Specifically, ACIArena establishes a unified specification that jointly supports MAS construction and attack–defense modules. It covers six widely used MAS implementations and provides a benchmark of 1,356 test cases for systematically evaluating MAS robustness. Our benchmarking results show that evaluating MAS robustness solely through topology is insufficient; robust MAS require deliberate role design and controlled interaction patterns. Moreover, defenses developed in simplified environments often fail to transfer to real-world settings; narrowly scoped defenses may even introduce new vulnerabilities. ACIArena aims to provide a solid foundation for advancing deeper exploration of MAS design principles.
PerMemSafe: Benchmarking Implicit Personalized Safety of Long Horizon Self-Evolving Agents
Hengyu An | Minxi Li | Naen Xu | Chunyi Zhou | Xiaogang Xu | Tianyu Du | Jinbao Li | Shouling Ji
Findings of the Association for Computational Linguistics: ACL 2026
Hengyu An | Minxi Li | Naen Xu | Chunyi Zhou | Xiaogang Xu | Tianyu Du | Jinbao Li | Shouling Ji
Findings of the Association for Computational Linguistics: ACL 2026
Self-evolving agents achieve personalization by accumulating user-specific memories over long horizons. This capability, however, introduces novel safety risks, as responses that are generally safe may become harmful in user-specific contexts. Such safety-relevant contexts often emerge implicitly and evolve over time during long-horizon conversations, rendering traditional context-independent safety evaluations insufficient. To address this, we formally define Implicit Personalized Safety and present PerMemSafe, the first benchmark for evaluating implicit personalized safety of self-evolving agents in long-horizon interactions. Empirical results reveal significant limitations of existing self-evolving agents, with even the strongest achieving only around 50% safety rate, highlighting systematic failures in reasoning about personalized safety risks. To mitigate this, we propose SentinelMem, an active risk-aware memory framework that explicitly models personalized risk inference and memory evolution. Experiments show that SentinelMem improves implicit personalized safety by 23.8% over prior memory frameworks while maintaining helpfulness in long-horizon interactions.
Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors
Rui Yin | Tianxu Han | Naen Xu | Changjiang Li | Ping He | Chunyi Zhou | Jun Wang | Zhihui Fu | Tianyu Du | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rui Yin | Tianxu Han | Naen Xu | Changjiang Li | Ping He | Chunyi Zhou | Jun Wang | Zhihui Fu | Tianyu Du | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., “Sure”), which does not guarantee sustained harmful output—the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.
2025
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents
Hengyu An | Jinghuai Zhang | Tianyu Du | Chunyi Zhou | Qingming Li | Tao Lin | Shouling Ji
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hengyu An | Jinghuai Zhang | Tianyu Du | Chunyi Zhou | Qingming Li | Tao Lin | Shouling Ji
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect\ Prompt\ Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model’s inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To\ prevent\ malicious\ tool\ invocations\ at\ the\ source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents’ task execution process as a traversal over a planned Tool\ Dependency\ Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.
DROWN: Towards Tighter LiRPA-based Robustness Certification
Yunruo Zhang | Tianyu Du | Shouling Ji | Shanqing Guo
Proceedings of the 31st International Conference on Computational Linguistics
Yunruo Zhang | Tianyu Du | Shouling Ji | Shanqing Guo
Proceedings of the 31st International Conference on Computational Linguistics
The susceptibility of deep neural networks to adversarial attacks is a well-established concern. To address this problem, robustness certification is proposed, which, unfortunately, suffers from precision or scalability issues. In this paper, we present DROWN (Dual CROWN), a novel method for certifying the robustness of DNNs. The advantage of DROWN is that it tightens classic LiRPA-based methods yet maintains similar scalability, which comes from refining pre-activation bounds of ReLU relaxations using two pairs of linear bounds derived from different relaxations of ReLU units in previous layers. The extensive evaluations show that DROWN achieves up to 83.39% higher certified robust accuracy than the baseline on CNNs and up to 4.68 times larger certified radii than the baseline on Transformers. Meanwhile, the running time of DROWN is about twice that of the baseline.
VideoEraser: Concept Erasure in Text-to-Video Diffusion Models
Naen Xu | Jinghuai Zhang | Changjiang Li | Zhi Chen | Chunyi Zhou | Qingming Li | Tianyu Du | Shouling Ji
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Naen Xu | Jinghuai Zhang | Changjiang Li | Zhi Chen | Chunyi Zhou | Qingming Li | Tianyu Du | Shouling Ji
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.
CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing
Boyu Zhang | Ping He | Tianyu Du | Xuhong Zhang | Lei Yun | Kingsum Chow | Jianwei Yin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Boyu Zhang | Ping He | Tianyu Du | Xuhong Zhang | Lei Yun | Kingsum Chow | Jianwei Yin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
With the widespread adoption of open-source code language models (code LMs), intellectual property (IP) protection has become an increasingly critical concern. While current watermarking techniques have the potential to identify the code LM to protect its IP, they have limitations when facing the more practical and complex demand, i.e., offering the individual user-level tracing in the black-box setting. This work presents CLMTracing, a black-box code LM watermarking framework employing the rule-based watermarks and utility-preserving injection method for user-level model tracing. CLMTracing further incorporates a parameter selection algorithm sensitive to the robust watermark and adversarial training to enhance the robustness against watermark removal attacks. Comprehensive evaluations demonstrate CLMTracing is effective across multiple state-of-the-art (SOTA) code LMs, showing significant harmless improvements compared to existing SOTA baselines and strong robustness against various removal attacks.
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Yuntai Bao | Xuhong Zhang | Tianyu Du | Xinkui Zhao | Zhengwen Feng | Hao Peng | Jianwei Yin
Findings of the Association for Computational Linguistics: ACL 2025
Yuntai Bao | Xuhong Zhang | Tianyu Du | Xinkui Zhao | Zhengwen Feng | Hao Peng | Jianwei Yin
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the “truth direction”, which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts.Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation.Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs.These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs.
2024
RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
Yanming Liu | Xinyue Peng | Xuhong Zhang | Weihao Liu | Jianwei Yin | Jiannan Cao | Tianyu Du
Findings of the Association for Computational Linguistics: ACL 2024
Yanming Liu | Xinyue Peng | Xuhong Zhang | Weihao Liu | Jianwei Yin | Jiannan Cao | Tianyu Du
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn’t previously by retrieving knowledge relevant to the query. This approach improves performance in certain scenarios for specific tasks. However, if irrelevant texts are retrieved, it may impair model performance. In this paper, we propose Retrieval Augmented Iterative Self-Feedback (RA-ISF), a framework that iteratively decomposes tasks and processes them in three submodules to enhance the model’s problem-solving capabilities. Experiments show that our method outperforms existing benchmarks, performing well on models like GPT3.5, Llama2, significantly enhancing factual reasoning capabilities and reducing hallucinations.
ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis
Yanming Liu | Xinyue Peng | Tianyu Du | Jianwei Yin | Weihao Liu | Xuhong Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanming Liu | Xinyue Peng | Tianyu Du | Jianwei Yin | Weihao Liu | Xuhong Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved commendable accomplishments in various natural language processing tasks. However, LLMs still encounter significant challenges when dealing with complex scenarios involving multiple entities. These challenges arise from the presence of implicit relationships that demand multi-step reasoning. In this paper, we propose a novel approach ERA-CoT, which aids LLMs in understanding context by capturing relationships between entities and supports the reasoning of diverse tasks through Chain-of-Thoughts (CoT).Experimental results show that ERA-CoT demonstrates the superior performance of our proposed method compared to current CoT prompting methods, achieving a significant improvement of an average of 5.1% on GPT3.5 compared to previous SOTA baselines. Our analysis indicates that ERA-CoT increases the LLM’s understanding of entity relationships, significantly improves the accuracy of question answering, and enhances the reasoning ability of LLMs.
SecCoder: Towards Generalizable and Robust Secure Code Generation
Boyu Zhang | Tianyu Du | Junkai Tong | Xuhong Zhang | Kingsum Chow | Sheng Cheng | Xun Wang | Jianwei Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Boyu Zhang | Tianyu Du | Junkai Tong | Xuhong Zhang | Kingsum Chow | Sheng Cheng | Xun Wang | Jianwei Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
After large models (LMs) have gained widespread acceptance in code-related tasks, their superior generative capacity has greatly promoted the application of the code LM. Nevertheless, the security of the generated code has raised attention to its potential damage. Existing secure code generation methods have limited generalizability to unseen test cases and poor robustness against the attacked model, leading to safety failures in code generation. In this paper, we propose a generalizable and robust secure code generation method SecCoder by using in-context learning (ICL) and the safe demonstration. The dense retriever is also used to select the most helpful demonstration to maximize the improvement of the generated code’s security. Experimental results show the superior generalizability of the proposed model SecCoder compared to the current secure code generation method, achieving a significant security improvement of an average of 7.20% on unseen test cases. The results also show the better robustness of SecCoder compared to the current attacked code LM, achieving a significant security improvement of an average of 7.74%. Our analysis indicates that SecCoder enhances the security of LMs in generating code, and it is more generalizable and robust.
Search
Fix author
Co-authors
- Shouling Ji 7
- Chunyi Zhou 6
- Naen Xu 5
- Jianwei Yin 5
- Xuhong Zhang 5
- Changjiang Li 4
- Hengyu An 3
- Jinbao Li 3
- Jinghuai Zhang 3
- Kingsum Chow 2
- Zhihui Fu 2
- Ping He 2
- Qingming Li 2
- Minxi Li 2
- Yanming Liu 2
- Xinyue Peng 2
- Jun Wang 2
- Xiaogang Xu 2
- Boyu Zhang 2
- Yuntai Bao 1
- Jiannan Cao 1
- Zhi Chen 1
- Sheng Cheng 1
- Zhengwen Feng 1
- Shanqing Guo 1
- Tianxu Han 1
- Yuyuan Li 1
- Tao Lin 1
- Weihao Liu 1
- Weihao Liu 1
- Hao Peng 1
- Jiayi Sheng 1
- Junkai Tong 1
- Xun Wang 1
- Rui Yin 1
- Lei Yun 1
- Yunruo Zhang 1
- Xinkui Zhao 1