2025
pdf
bib
abs
Understand the Implication: Learning to Think for Pragmatic Understanding
Settaluri Lakshmi Sravanthi
|
Kishan Maharaj
|
Sravani Gunnu
|
Abhijit Mishra
|
Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: ACL 2025
Pragmatics, the ability to infer meaning beyond literal interpretation, is crucial for social cognition and communication. While LLMs have been benchmarked for their pragmatic understanding, improving their performance remains underexplored. Existing methods rely on annotated labels but overlook the reasoning process humans naturally use to interpret implicit meaning. To bridge this gap, we introduce a novel pragmatic dataset ImpliedMeaningPreference that includes explicit reasoning (‘thoughts’) for both correct and incorrect interpretations. Through preference-tuning and supervised fine-tuning, we demonstrate that thought-based learning significantly enhances LLMs’ pragmatic understanding, improving accuracy by 11.12% across model families. We further discuss a transfer-learning study where we evaluate the performance of thought-based training for the other tasks of pragmatics (presupposition, deixis) that are not seen during the training time and observe an improvement of 16.10% compared to label trained models.
pdf
bib
abs
From Perception to Reasoning: Enhancing Vision-Language Models for Mobile UI Understanding
Settaluri Lakshmi Sravanthi
|
Ankit Mishra
|
Debjyoti Mondal
|
Subhadarshi Panda
|
Rituraj Singh
|
Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: ACL 2025
Accurately grounding visual and textual elements within mobile user interfaces (UIs) remains a significant challenge for Vision-Language Models (VLMs). Visual grounding, a critical task in this domain, involves identifying the most relevant UI element or region based on a natural language query—a process that requires both precise perception and context-aware reasoning. In this work, we present - **MoUI**, a light-weight mobile UI understanding model trained on **MoIT**, an instruction-tuning dataset specifically tailored for mobile screen understanding and grounding, designed to bridge the gap between user intent and visual semantics. Complementing this dataset, we also present a human-annotated reasoning benchmark **MoIQ** that rigorously evaluates complex inference capabilities over mobile UIs. To harness these resources effectively, we propose a two-stage training approach that separately addresses perception and reasoning tasks, leading to stronger perception capabilities and improvement in reasoning abilities. Through extensive experiments, we demonstrate that our MoUI models achieve significant gains in accuracy across all perception tasks and _state-of-the-art_ results on public reasoning benchmark **ComplexQA (78%) and our MoIQ (49%)**. We will be open-sourcing our dataset, code, and models to foster further research and innovation in the field.
pdf
bib
abs
RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering
Settaluri Lakshmi Sravanthi
|
Pulkit Agarwal
|
Debjyoti Mondal
|
Rituraj Singh
|
Subhadarshi Panda
|
Ankit Mishra
|
Kiran Pradeep
|
Srihari K B
|
Godawari Sudhakar Rao
|
Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2025
In this paper, we propose a method to improve the reasoning capabilities of Visual Question Answering (VQA) systems by integrating Dense Passage Retrievers (DPRs) with Vision Language Models (VLMs). While recent works focus on the application of knowledge graphs and chain-of-thought reasoning, we recognize that the complexity of graph neural networks and end-to-end training remain significant challenges. To address these issues, we introduce **R**elevance **G**uided **VQA** (**RG-VQA**), a retriever-generator pipeline that uses DPRs to efficiently extract relevant information from structured knowledge bases. Our approach ensures scalability to large graphs without significant computational overhead. Experiments on the ScienceQA dataset show that RG-VQA achieves state-of-the-art performance, surpassing human accuracy and outperforming GPT-4 by more than . This demonstrates the effectiveness of RG-VQA in boosting the reasoning capabilities of VQA systems and its potential for practical applications.