Kunwar Yashraj Singh
2025
R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding
Joonhyung Park
|
Peng Tang
|
Sagnik Das
|
Srikar Appalaraju
|
Kunwar Yashraj Singh
|
R. Manmatha
|
Shabnam Ghadar
Findings of the Association for Computational Linguistics: ACL 2025
Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.
On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
Tianyang Zhao
|
Kunwar Yashraj Singh
|
Srikar Appalaraju
|
Peng Tang
|
Ying Nian Wu
|
Li Erran Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
A small subset of dimensions within language Transformers’ representation spaces emerge as “outliers” during pretraining, encoding critical knowledge sparsely. We extend previous findings on emergent outliers to Encoder-Decoder Transformers and instruction-finetuned models, and tackle the problem of distilling a student Transformer from a larger teacher Transformer. Knowledge distillation reduces model size and cost by transferring knowledge from a larger teacher to a smaller student, necessitating a trade-off among representation dimensions. We show that emergent outlier dimensions contribute significantly more to zero-shot performance than non-outlier dimensions. Based on this, we propose the Emergent Outlier Focused Distillation (EOFD) method, which prioritizes critical outlier dimensions in distillation using a weighted MSE loss. We empirically demonstrate that EOFD outperforms state-of-the-art distillation methods and generalizes well across Encoder-only BERT, Decoder-only GPT-2, and Encoder-Decoder T5 architectures.
Search
Fix author
Co-authors
- Srikar Appalaraju 2
- Peng Tang 2
- Sagnik Das 1
- Shabnam Ghadar 1
- Li Erran Li 1
- show all...