Zhixiang Liang
2025
Fine-Grained Constraint Generation-Verification for Improved Instruction-Following
Zhixiang Liang
|
Zhenyu Hou
|
Xiao Wang
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
The ability of Large Language Models (LLMs) to follow natural language instructions is crucial. However, numerous studies have demonstrated that LLMs still struggle to follow instructions with complex constraints, limiting their application in other areas. Meanwhile, obtaining high-quality instruction-following data often requires substantial manual annotation, which is both time-consuming and labor-intensive. In this work, we present FiGV, a fine-grained constraint generation-verification strategy for synthesizing instruction-following data. FiGV employs LLM-driven processes to generate fine-grained constraints and check the legality of the synthetic instructions. Subsequently, LLMs are utilized to perform nuanced, constraint-level verification to determine whether the generated responses adhere to the synthetic instructions, with LLM-generated functions incorporated for auxiliary validation tailored to the types of constraints. Experiments on 7B to 70B models demonstrate that FiGV consistently achieves strong performance across various benchmarks designed to evaluate the instruction-following capabilities of LLMs.
2024
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Qiusi Zhan
|
Zhixiang Liang
|
Zifan Ying
|
Daniel Kang
Findings of the Association for Computational Linguistics: ACL 2024
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative.In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We conduct a comprehensive evaluation of 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates. Our findings raise questions about the widespread deployment of LLM Agents.