Rabiul Awal


2025

pdf bib
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal | Mahsa Massoud | Aarash Feizi | Zichao Li | Suyuchen Wang | Christopher Pal | Aishwarya Agrawal | David Vazquez | Siva Reddy | Juan A. Rodriguez | Perouz Taslakian | Spandana Gella | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models’ abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

pdf bib
ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Ahmed Masry | Megh Thakkar | Patrice Bechard | Sathwik Tejaswi Madhusudhan | Rabiul Awal | Shambhavi Mishra | Akshay Kalkunte Suresh | Srivatsava Daruru | Enamul Hoque | Spandana Gella | Torsten Scholak | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

2024

pdf bib
Benchmarking Vision Language Models for Cultural Understanding
Shravan Nayak | Kanishk Jain | Rabiul Awal | Siva Reddy | Sjoerd Van Steenkiste | Lisa Anne Hendricks | Karolina Stanczak | Aishwarya Agrawal
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM’s geo-diverse cultural understanding. We curate a diverse collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly weaker capabilities for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.