Minglong Tang


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning
Meng Ziyang | Yu Dai | Zezheng Gong | Shaoxiong Guo | Minglong Tang | Tongquan Wei
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Vision-Language Models (VLMs) have already been applied to the understanding of Graphical User Interfaces (GUIs) and have achieved notable results. However, existing VLMs often overly rely on internal text-based knowledge while neglecting visual inputs. This imbalance may lead models to produce answers that do not align with the visual content in GUI comprehension tasks. Such inaccuracies are termed as ‘hallucinations’ where models generate incorrect or illogical responses upon visual verification against GUI elements. These errors result in misinterpretations and diminish the model’s practical utility in applied settings. To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to balance attention image and text to enhance interpretation and reduce hallucinations. We construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose *Referent Method*, focusing on response with visual content of images. We then design a two-stage fine-tuning method to enhance both the model’s accuracy to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model’s ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. https://github.com/Linziyang1999/VGA-visual-GUI-assistant