Junyuan Zhang
2026
JARVIS or Ultron? A Survey on the Safety and Security Threats of Computer-Using Agents
Ada Chen | Yongjiang Wu | Junyuan Zhang | Jingyu Xiao | Shu Yang | Jen-tse Huang | Kun Wang | Wenxuan Wang | Shuai Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ada Chen | Yongjiang Wu | Junyuan Zhang | Jingyu Xiao | Shu Yang | Jen-tse Huang | Kun Wang | Wenxuan Wang | Shuai Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of Computer-Using Agents (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: (i) define the CUA that suits safety analysis; (ii) categorize current safety threats among CUAs; (iii) propose a comprehensive taxonomy of existing defensive strategies; (iv) summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
2025
Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators. Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99× and 2.99× speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators.