Haon Park

2026

As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13–40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.

2025

pdf bib abs

Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2tox Offense), a novel attack framework that systematically bypasses refusal-trained safeguards in commercial computer-use agents, such as Claude for Computer Use. The core mechanism, Detox2tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24.41% (with no refinement), and up to 41.33% (by its iterative refinement) in Claude for Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs.

pdf bib abs

We introduce a novel framework for consolidating multi-turn adversarial “jailbreak” prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates (ASRs), they demand considerable human effort and time. Our proposed Multi-turn-to-Single-turn (M2S) methods—Hyphenize, Numberize, and Pythonize—systematically reformat multi-turn dialogues into structured single-turn prompts. Despite eliminating iterative back-and-forth interactions, these reformatted prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods yield ASRs ranging from 70.6 % to 95.9 % across various state-of-the-art LLMs. Remarkably, our single-turn prompts outperform the original multi-turn attacks by up to 17.5 % in absolute ASR, while reducing token usage by more than half on average. Further analyses reveal that embedding malicious requests in enumerated or code-like structures exploits “contextual blindness,” undermining both native guardrails and external input-output safeguards. By consolidating multi-turn conversations into efficient single-turn prompts, our M2S framework provides a powerful tool for large-scale red-teaming and exposes critical vulnerabilities in contemporary LLM defenses. All code, data, and conversion prompts are available for reproducibility and further investigations: https://github.com/Junuha/M2S_DATA

Co-authors

Joonwon Jang 1

Minsuk Kahng 1

Brigitta Jesica Kartono 1

Venues

ACL3

Fix author