CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

Xinbei Ma, Zhuosheng Zhang, Hai Zhao


Abstract
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation.However, those GUI agents require comprehensive cognition including exhaustive perception and reliable action response.We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel.Second, CAP decomposes the action prediction into sub-problems: determining the action type and then identifying the action target conditioned on the action type.With our technical design, our agent achieves state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios. Code is available at https://github.com/xbmxb/CoCo-Agent.
Anthology ID:
2024.findings-acl.539
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9097–9110
Language:
URL:
https://aclanthology.org/2024.findings-acl.539
DOI:
10.18653/v1/2024.findings-acl.539
Bibkey:
Cite (ACL):
Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. 2024. CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9097–9110, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation (Ma et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.findings-acl.539.pdf