2025
pdf
bib
abs
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
Haoyang Li
|
Huan Gao
|
Zhiyuan Zhao
|
Zhiyu Lin
|
Junyu Gao
|
Xuelong Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The widespread adoption of Large Language Models (LLMs) has heightened concerns about their security, particularly their vulnerability to jailbreak attacks that leverage crafted prompts to generate malicious outputs. While prior research has been conducted on general security capabilities of LLMs, their specific susceptibility to jailbreak attacks in code generation remains largely unexplored. To fill this gap, we propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation, designed to evaluate LLM robustness against such threats. MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories. Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model’s security capabilities: specifically, the average rejection rate for malicious content is 60.93%, dropping to 39.92% when combined with jailbreak attack algorithms. Our work highlights that the code security capabilities of LLMs still pose significant challenges.
pdf
bib
abs
Logic-Regularized Verifier Elicits Reasoning from LLMs
Xinyu Wang
|
Changzhi Sun
|
Lian Cheng
|
Yuanbin Wu
|
Dell Zhang
|
Xiaoling Wang
|
Xuelong Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Verifiers are crucial components for enhancing modern LLMs’ reasoning capability. Typical verifiers require resource-intensive supervised dataset construction, which is costly and faces limitations in data diversity. In this paper, we propose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats the verifier as a binary latent variable, utilizing internal activations and enforcing three logical constraints on multiple reasoning paths: negation consistency, intra-group consistency, and inter-group consistency (grouped by the final answer). By incorporating logical rules as priors, LOVER can leverage unlabeled examples and is directly compatible with any off-the-shelf LLMs. Experiments on 10 datasets demonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier (reaching its 95% level on average).
pdf
bib
abs
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration
Yang Zhang
|
Shixin Yang
|
Chenjia Bai
|
Fei Wu
|
Xiu Li
|
Zhen Wang
|
Xuelong Li
Findings of the Association for Computational Linguistics: ACL 2025
Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://read-llm.github.io/.
pdf
bib
abs
INT: Establishing Information Transfer for Multilingual Intent Detection and Slot Filling
Di Wu
|
Liting Jiang
|
Bohui Mao
|
Hongyan Xie
|
Haoxiang Su
|
Zhongjiang He
|
Ruiyu Fang
|
Shuangyong Song
|
Hao Huang
|
Xuelong Li
Findings of the Association for Computational Linguistics: ACL 2025
Multilingual spoken language understanding (SLU) involves intent detection (ID) and slot filling (SF) across multiple languages. The inherent linguistic diversity presents significant challenges in achieving performance comparable to traditional SLU. Recent studies have attempted to improve multilingual SLU performance by sharing multilingual encoders. However, these approaches have not directly established information flow between languages. To address this, we first demonstrate the feasibility of such information transfer and pinpoint the key challenges: prediction error mitigation and multilingual slot alignment. We then propose the INformation Transfer network (INT) to tackle these challenges. The gate unit in INT controls the information flow between languages, reducing the adverse impact of prediction errors on both ID and SF. Additionally, we reformulate SF as a span prediction problem and introduce a slot-matching attention mechanism to achieve slot alignment across languages. Experimental results on the MASSIVE and MASSIVE-UG datasets show that our model outperforms all baselines in overall accuracy across all languages, and demonstrates robust performance when different languages are used as the source.
pdf
bib
abs
WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code
Zhiyu Lin
|
Zhengda Zhou
|
Zhiyuan Zhao
|
Tianrui Wan
|
Yilun Ma
|
Junyu Gao
|
Xuelong Li
Findings of the Association for Computational Linguistics: ACL 2025
With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming, WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.
2024
pdf
bib
abs
Dual Prompt Tuning based Contrastive Learning for Hierarchical Text Classification
Sishi Xiong
|
Yu Zhao
|
Jie Zhang
|
Li Mengxiang
|
Zhongjiang He
|
Xuelong Li
|
Shuangyong Song
Findings of the Association for Computational Linguistics: ACL 2024
Hierarchical text classification aims at categorizing texts into a multi-tiered tree-structured hierarchy of labels. Existing methods pay more attention to capture hierarchy-aware text feature by exploiting explicit parent-child relationships, while interactions between peer labels are rarely taken into account, resulting in severe label confusion within each layer. In this work, we propose a novel Dual Prompt Tuning (DPT) method, which emphasizes identifying discrimination among peer labels by performing contrastive learning on each hierarchical layer. We design an innovative hand-crafted prompt containing slots for both positive and negative label predictions to cooperate with contrastive learning. In addition, we introduce a label hierarchy self-sensing auxiliary task to ensure cross-layer label consistency. Extensive experiments demonstrate that DPT achieves significant improvements and outperforms the current state-of-the-art methods on BGC and RCV1-V2 benchmark datasets.
2022
pdf
bib
abs
Search to Pass Messages for Temporal Knowledge Graph Completion
Zhen Wang
|
Haotong Du
|
Quanming Yao
|
Xuelong Li
Findings of the Association for Computational Linguistics: EMNLP 2022
Completing missing facts is a fundamental task for temporal knowledge graphs (TKGs).Recently, graph neural network (GNN) based methods, which can simultaneously explore topological and temporal information, have become the state-of-the-art (SOTA) to complete TKGs. However, these studies are based on hand-designed architectures and fail to explore the diverse topological and temporal properties of TKG.To address this issue, we propose to use neural architecture search (NAS) to design data-specific message passing architecture for TKG completion.In particular, we develop a generalized framework to explore topological and temporal information in TKGs.Based on this framework, we design an expressive search space to fully capture various properties of different TKGs. Meanwhile, we adopt a search algorithm, which trains a supernet structure by sampling single path for efficient search with less cost.We further conduct extensive experiments on three benchmark datasets. The results show that the searched architectures by our method achieve the SOTA performances.Besides, the searched models can also implicitly reveal diverse properties in different TKGs.Our code is released in https://github.com/striderdu/SPA.