Srivatsava Daruru
2025
ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Ahmed Masry
|
Megh Thakkar
|
Patrice Bechard
|
Sathwik Tejaswi Madhusudhan
|
Rabiul Awal
|
Shambhavi Mishra
|
Akshay Kalkunte Suresh
|
Srivatsava Daruru
|
Enamul Hoque
|
Spandana Gella
|
Torsten Scholak
|
Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.
Search
Fix author
Co-authors
- Rabiul Awal 1
- Patrice Bechard 1
- Spandana Gella 1
- Enamul Hoque 1
- Sathwik Tejaswi Madhusudhan 1
- show all...