Israfel Salazar
2026
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
Israfel Salazar | Desmond Elliott | Yova Kementchedjhieva
Findings of the Association for Computational Linguistics: ACL 2026
Israfel Salazar | Desmond Elliott | Yova Kementchedjhieva
Findings of the Association for Computational Linguistics: ACL 2026
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.
2025
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Xiaofu Chen | Israfel Salazar | Yova Kementchedjhieva
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xiaofu Chen | Israfel Salazar | Yova Kementchedjhieva
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development.We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation
Emilio Villa-Cueva | Sholpan Bolatzhanova | Diana Turmakhan | Kareem Elzeky | Henok Biadglign Ademtew | Alham Fikri Aji | Vladimir Araujo | Israel Abebe Azime | Jinheon Baek | Frederico Belcavello | Fermin Cristobal | Jan Christian Blaise Cruz | Mary Dabre | Raj Dabre | Toqeer Ehsan | Naome A Etori | Fauzan Farooqui | Jiahui Geng | Guido Ivetta | Thanmay Jayakumar | Soyeong Jeong | Zheng Wei Lim | Aishik Mandal | Sofía Martinelli | Mihail Minkov Mihaylov | Daniil Orel | Aniket Pramanick | Sukannya Purkayastha | Israfel Salazar | Haiyue Song | Tiago Timponi Torrent | Debela Desalegn Yadeta | Injy Hamed | Atnafu Lambebo Tonja | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Emilio Villa-Cueva | Sholpan Bolatzhanova | Diana Turmakhan | Kareem Elzeky | Henok Biadglign Ademtew | Alham Fikri Aji | Vladimir Araujo | Israel Abebe Azime | Jinheon Baek | Frederico Belcavello | Fermin Cristobal | Jan Christian Blaise Cruz | Mary Dabre | Raj Dabre | Toqeer Ehsan | Naome A Etori | Fauzan Farooqui | Jiahui Geng | Guido Ivetta | Thanmay Jayakumar | Soyeong Jeong | Zheng Wei Lim | Aishik Mandal | Sofía Martinelli | Mihail Minkov Mihaylov | Daniil Orel | Aniket Pramanick | Sukannya Purkayastha | Israfel Salazar | Haiyue Song | Tiago Timponi Torrent | Debela Desalegn Yadeta | Injy Hamed | Atnafu Lambebo Tonja | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.
Search
Fix author
Co-authors
- Yova Kementchedjhieva 2
- Henok Biadglign Ademtew 1
- Alham Fikri Aji 1
- Vladimir Araujo 1
- Israel Abebe Azime 1
- Jinheon Baek 1
- Frederico Belcavello 1
- Sholpan Bolatzhanova 1
- Xiaofu Chen 1
- Fermin Cristobal 1
- Jan Christian Blaise Cruz 1
- Mary Dabre 1
- Raj Dabre 1
- Toqeer Ehsan 1
- Desmond Elliott 1
- Kareem Elzeky 1
- Naome A. Etori 1
- Fauzan Farooqui 1
- Jiahui Geng 1
- Injy Hamed 1
- Guido Ivetta 1
- Thanmay Jayakumar 1
- Soyeong Jeong 1
- Zheng Wei Lim 1
- Aishik Mandal 1
- Sofía Martinelli 1
- Mihail Minkov Mihaylov 1
- Daniil Orel 1
- Aniket Pramanick 1
- Sukannya Purkayastha 1
- Thamar Solorio 1
- Haiyue Song 1
- Atnafu Lambebo Tonja 1
- Tiago Timponi Torrent 1
- Diana Turmakhan 1
- Emilio Villa-Cueva 1
- Debela Desalegn Yadeta 1