Shi Chen


2026

Automating Graphical User Interface (GUI) operations with Multimodal Large Language Models (MLLMs) is promising but remains bottlenecked in real-world long-horizon settings. Key challenges include ensuring precise grounding across diverse interfaces and handling irreversible errors in extended workflows. Current methods often struggle to distinguish targets in low Signal-to-Noise Ratio (SNR) environments and lack sufficient pre-execution verification to prevent error accumulation. To address this, we propose the Memory-augmented Debate System (MaDS). Specifically, MaDS combines: (1) a Dual-Layer Memory Module that integrates universal interaction priors with scenario-specific operational experience to mitigate grounding hallucinations; and (2) Multi-Round Debate that performs pre-execution verification, while transforming execution failures into retrievable Negative Warnings to reduce repeated errors. Additionally, we introduce MaDS-Benchmark, a benchmark for long-horizon mobile GUI tasks with process-oriented evaluation. Experiments show that MaDS achieves a 90.23% Task Success Rate on MaDS-Benchmark and strong performance on public benchmarks including AITW, AITZ, CAGUI, and GUIOdyssey.

2021

This paper introduces the related content of the task “Offensive Language Identification in Dravidian LANGUAGES-EACL 2021”. The task requires us to classify Dravidian languages collected from social media into Not-Offensive, Off-Untargeted, Off-Target-Individual, etc. This data set contains actual annotations in code-mixed text posted by users on Youtube, not from the monolingual text in textbooks. Based on the features of the data set code mixture, we use multilingual BERT and TextCNN for semantic extraction and text classification. In this article, we will show the experiment and result analysis of this task.
This paper mainly introduces the relevant content of the task “Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI 2021-EACL 2021”. A total of three language datasets were provided, and we chose the English dataset to complete this task. The specific task objective is to classify the given speech into ‘Hope speech’, ‘Not Hope speech’, and ‘Not in intended language’. In terms of method, we use fine-tuned ALBERT and K fold cross-validation to accomplish this task. In the end, we achieved a good result in the rank list of the task result, and the final F1 score was 0.93, tying for first place. However, we will continue to try to improve methods to get better results in future work.