Zihe Yan

2026

Autonomous GUI agents are inherently vulnerable to Environmental Injection Attacks (EIAs). However, existing red-teaming methods face a trade-off between requiring target-specific knowledge and incurring prohibitive computational costs. More fundamentally, a key question remains: what factors determine attack success? To answer this, we first analyze two dimensions: visual appearance (e.g., position, size, color) and semantic content. We find that semantic content dominates, while visual variations have negligible impact. Leveraging this insight, we introduce EVA, a framework that evolves payloads exclusively on the semantic dimension via a discovery-deployment pipeline. Experiments demonstrate that EVA significantly outperforms baselines, achieving 59% to 85% average Attack Success Rate (ASR) while evolving benign seeds into successful attacks within 1.18 to 1.71 iterations. This rapid convergence suggests a dense semantic attack space within the model’s latent space. Whenever an input falls into this space, the agent becomes inherently vulnerable, exposing a fundamental alignment flaw in current multimodal representations.

Co-authors

Xinbei Ma 1

Zhuosheng Zhang 1

Manman Zhao 1

Venues

Findings1

Fix author