Vishnu Prabhakaran
2026
MTIVE: Multi-Task Image Verification Engine Using Vision-Language Models for E-commerce
Yu-Tong Cao | Vishnu Prabhakaran | Arunita Das | Purav Aggarwal | Anoop Saladi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yu-Tong Cao | Vishnu Prabhakaran | Arunita Das | Purav Aggarwal | Anoop Saladi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Vision-language models show promise for e-commerce automation but struggle with noisy real-world images and multi-task requirements. We introduce MTIVE, a curriculum learning framework that progressively adapts base models through three stages: continued pre-training on large-scale e-commerce datasets with contrastive learning and diverse dialogue templates, instruction tuning on synthetic data, and modular task-specific expert training. Our architecture uses frozen base weights with stacked LoRA adapters—shared modules for domain knowledge and lightweight task-specific experts—enabling continual learning without catastrophic forgetting. MTIVE outperforms open-source and proprietary baselines in both standard and continual learning settings.
2025
VIT-Pro: Visual Instruction Tuning for Product Images
Vishnu Prabhakaran | Purav Aggarwal | Vishruit Kulshreshtha | Arunita Das | Sahini Venkata Sitaram Sruti | Anoop Saladi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Vishnu Prabhakaran | Purav Aggarwal | Vishruit Kulshreshtha | Arunita Das | Sahini Venkata Sitaram Sruti | Anoop Saladi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
General vision-language models (VLMs) trained on web data struggle to understand and converse about real-world e-commerce product images. We propose a cost-efficient approach for collecting training data to train a generative VLM for e-commerce product images. The key idea is to leverage large-scale, loosely-coupled image-text pairs from e-commerce stores, use a pretrained LLM to generate multimodal instruction-following data, and fine-tune a general vision-language model using LoRA. Our instruction-finetuned model, VIT-Pro, can understand and respond to queries about product images, covering diverse concepts and tasks. VIT-Pro outperforms several general-purpose VLMs on multiple vision tasks in the e-commerce domain.
VADE: Visual Attention Guided Hallucination Detection and Elimination
Vishnu Prabhakaran | Purav Aggarwal | Vinay Kumar Verma | Gokul Swamy | Anoop Saladi
Findings of the Association for Computational Linguistics: ACL 2025
Vishnu Prabhakaran | Purav Aggarwal | Vinay Kumar Verma | Gokul Swamy | Anoop Saladi
Findings of the Association for Computational Linguistics: ACL 2025
Vision Language Models (VLMs) have achieved significant advancements in complex visual understanding tasks. However, VLMs are prone to hallucinations—generating outputs that lack alignment with visual content. This paper addresses hallucination detection in VLMs by leveraging the visual grounding information encoded in transformer attention maps. We identify three primary challenges in this approach: the elective nature of visual grounding for certain tokens, the high-dimensional and noisy nature of attention maps, and the dynamic sequence length of attention on previous tokens. To address these, we propose VADE, a novel sequence modelling approach to effectively learn complex sequential patterns from high-dimensional and noisy attention maps for fine-grained hallucination detection and mitigation. VADE achieves an average PR-AUC of 80% in hallucination detection on M-HalDetect across four different model architectures and an 5% improvement in hallucination mitigation on MSCOCO.