Junjie Fang
2025
GUICourse: From General Vision Language Model to Versatile GUI Agent
Wentong Chen
|
Junbo Cui
|
Jinyi Hu
|
Yujia Qin
|
Junjie Fang
|
Yue Zhao
|
Chongyi Wang
|
Jun Liu
|
Guirong Chen
|
Yupeng Huo
|
Yuan Yao
|
Yankai Lin
|
Zhiyuan Liu
|
Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Utilizing Graphic User Interfaces (GUIs) for human-computer interaction is essential for accessing various digital tools. Recent advancements in Vision Language Models (VLMs) reveal significant potential for developing versatile agents that assist humans in navigating GUIs. However, current VLMs face challenges related to fundamental abilities, such as OCR and grounding, as well as a lack of knowledge about GUI elements functionalities and control methods. These limitations hinder their effectiveness as practical GUI agents. To address these challenges, we introduce GUICourse, a series of datasets for training visual-based GUI agents using general VLMs. First, we enhance the OCR and grounding capabilities of VLMs using the GUIEnv dataset. Next, we enrich the GUI knowledge of VLMs using the GUIAct and GUIChat datasets. Our experiments demonstrate that even a small-sized GUI agent (with 3.1 billion parameters) performs effectively on both single-step and multi-step GUI tasks. We further finetune our GUI agents on other GUI tasks with different action spaces (AITW and Mind2Web), and the results show that our agents are better than their baseline VLMs. Additionally, we analyze the impact of OCR and grounding capabilities through an ablation study, revealing a positive correlation with GUI navigation ability.
FocusLLM: Precise Understanding of Long Context by Dynamic Condensing
Zhenyu Li
|
Yike Zhang
|
Tengyu Pan
|
Yutao Sun
|
Zhichao Duan
|
Junjie Fang
|
Rong Han
|
Zixuan Wang
|
Jianyong Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Empowering LLMs with the ability to precisely understand long contexts is crucial for many downstream applications. However, handling long contexts with conventional transformer architecture requires substantial training and inference resources. Existing context condensing methods cannot accurately understand the full context, as there is a considerable amount of information loss in the condensing process. To address these issues, we present **FocusLLM**, a framework designed to extend the fixed context length of any decoder-only LLM, allowing the model to focus on relevant information from very long sequences. FocusLLM first divides long text input into chunks based on the model’s original context length. It then employs the **_dynamic condensing_** process to distill crucial information from each chunk. Ultimately, through the novel **_parallel decoding_** mechanism, FocusLLM can integrate the extracted information into its local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length and with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream tasks and maintains strong language modeling ability when handling extensive long texts, even up to 400K tokens. Our code is available at https://github.com/leezythu/FocusLLM.
Search
Fix author
Co-authors
- Wentong Chen 1
- Guirong Chen 1
- Junbo Cui 1
- Zhichao Duan 1
- Rong Han 1
- show all...
Venues
- acl2