Sai Rajeswar
2025
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal | Mahsa Massoud | Aarash Feizi | Zichao Li | Suyuchen Wang | Christopher Pal | Aishwarya Agrawal | David Vazquez | Siva Reddy | Juan A. Rodriguez | Perouz Taslakian | Spandana Gella | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Rabiul Awal | Mahsa Massoud | Aarash Feizi | Zichao Li | Suyuchen Wang | Christopher Pal | Aishwarya Agrawal | David Vazquez | Siva Reddy | Juan A. Rodriguez | Perouz Taslakian | Spandana Gella | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models’ abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Ahmed Masry | Megh Thakkar | Patrice Bechard | Sathwik Tejaswi Madhusudhan | Rabiul Awal | Shambhavi Mishra | Akshay Kalkunte Suresh | Srivatsava Daruru | Enamul Hoque | Spandana Gella | Torsten Scholak | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Ahmed Masry | Megh Thakkar | Patrice Bechard | Sathwik Tejaswi Madhusudhan | Rabiul Awal | Shambhavi Mishra | Akshay Kalkunte Suresh | Srivatsava Daruru | Enamul Hoque | Spandana Gella | Torsten Scholak | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.
2017
Adversarial Generation of Natural Language
Sandeep Subramanian | Sai Rajeswar | Francis Dutil | Chris Pal | Aaron Courville
Proceedings of the 2nd Workshop on Representation Learning for NLP
Sandeep Subramanian | Sai Rajeswar | Francis Dutil | Chris Pal | Aaron Courville
Proceedings of the 2nd Workshop on Representation Learning for NLP
Generative Adversarial Networks (GANs) have gathered a lot of attention from the computer vision community, yielding impressive results for image generation. Advances in the adversarial generation of natural language from noise however are not commensurate with the progress made in generating images, and still lag far behind likelihood based methods. In this paper, we take a step towards generating natural language with a GAN objective alone. We introduce a simple baseline that addresses the discrete output space problem without relying on gradient estimators and show that it is able to achieve state-of-the-art results on a Chinese poem generation dataset. We present quantitative results on generating sentences from context-free and probabilistic context-free grammars, and qualitative language modeling results. A conditional version is also described that can generate sequences conditioned on sentence characteristics.
Search
Fix author
Co-authors
- Rabiul Awal 2
- Spandana Gella 2
- Christopher Pal 2
- Aishwarya Agrawal 1
- Patrice Bechard 1
- Aaron Courville 1
- Srivatsava Daruru 1
- Francis Dutil 1
- Aarash Feizi 1
- Enamul Hoque 1
- Zichao Li 1
- Sathwik Tejaswi Madhusudhan 1
- Ahmed Masry 1
- Mahsa Massoud 1
- Shambhavi Mishra 1
- Siva Reddy 1
- Juan A. Rodriguez 1
- Torsten Scholak 1
- Sandeep Subramanian 1
- Akshay Kalkunte Suresh 1
- Perouz Taslakian 1
- Megh Thakkar 1
- David Vazquez 1
- Suyuchen Wang 1