Danyang Zhang


2025

pdf bib
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation
Zichen Zhu | Hao Tang | Yansi Li | Dingye Liu | Hongshen Xu | Kunyao Lan | Danyang Zhang | Yixuan Jiang | Hao Zhou | Chenrun Wang | Situo Zhang | Liangtai Sun | Yixiao Wang | Yuheng Sun | Lu Chen | Kai Yu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module’s execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA’s ability to handle dynamic GUI environments and perform complex mobile tasks.

2021

pdf bib
WebSRC: A Dataset for Web-Based Structural Reading Comprehension
Xingyu Chen | Zihan Zhao | Lu Chen | JiaBao Ji | Danyang Zhang | Ao Luo | Yuxuan Xiong | Kai Yu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Web search is an essential way for humans to obtain information, but it’s still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and baselines have been publicly available.