This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
AnkitMishra
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Accurately grounding visual and textual elements within mobile user interfaces (UIs) remains a significant challenge for Vision-Language Models (VLMs). Visual grounding, a critical task in this domain, involves identifying the most relevant UI element or region based on a natural language query—a process that requires both precise perception and context-aware reasoning. In this work, we present - **MoUI**, a light-weight mobile UI understanding model trained on **MoIT**, an instruction-tuning dataset specifically tailored for mobile screen understanding and grounding, designed to bridge the gap between user intent and visual semantics. Complementing this dataset, we also present a human-annotated reasoning benchmark **MoIQ** that rigorously evaluates complex inference capabilities over mobile UIs. To harness these resources effectively, we propose a two-stage training approach that separately addresses perception and reasoning tasks, leading to stronger perception capabilities and improvement in reasoning abilities. Through extensive experiments, we demonstrate that our MoUI models achieve significant gains in accuracy across all perception tasks and _state-of-the-art_ results on public reasoning benchmark **ComplexQA (78%) and our MoIQ (49%)**. We will be open-sourcing our dataset, code, and models to foster further research and innovation in the field.
In this paper, we propose a method to improve the reasoning capabilities of Visual Question Answering (VQA) systems by integrating Dense Passage Retrievers (DPRs) with Vision Language Models (VLMs). While recent works focus on the application of knowledge graphs and chain-of-thought reasoning, we recognize that the complexity of graph neural networks and end-to-end training remain significant challenges. To address these issues, we introduce **R**elevance **G**uided **VQA** (**RG-VQA**), a retriever-generator pipeline that uses DPRs to efficiently extract relevant information from structured knowledge bases. Our approach ensures scalability to large graphs without significant computational overhead. Experiments on the ScienceQA dataset show that RG-VQA achieves state-of-the-art performance, surpassing human accuracy and outperforming GPT-4 by more than . This demonstrates the effectiveness of RG-VQA in boosting the reasoning capabilities of VQA systems and its potential for practical applications.
Hope Speech are positive terms that help to promote or criticise a point of view without hurting the user’s or community’s feelings. Non-Hope Speech, on the other side, includes expressions that are harsh, ridiculing, or demotivating. The goal of this article is to find the hope speech comments in a YouTube dataset. The datasets were created as part of the “LT-EDI-ACL 2022: Hope Speech Detection for Equality, Diversity, and Inclusion” shared task. The shared task dataset was proposed in Malayalam, Tamil, English, Spanish, and Kannada languages. In this paper, we worked at English-language YouTube comments. We employed several deep learning based models such as DNN (dense or fully connected neural network), CNN (Convolutional Neural Network), Bi-LSTM (Bidirectional Long Short Term Memory Network), and GRU(Gated Recurrent Unit) to identify the hopeful comments. We also used Stacked LSTM-CNN and Stacked LSTM-LSTM network to train the model. The best macro average F1-score 0.67 for development dataset was obtained using the DNN model. The macro average F1-score of 0.67 was achieved for the classification done on the test data as well.