WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles

Özge Alaçam, Ronja Utescher, Hannes Grönner, Judith Sieker, Sina Zarrieß


Abstract
Research in Language & Vision rarely uses naturally occurring multimodal documents as Wikipedia articles, since they feature complex image-text relations and implicit image-text alignments. In this paper, we provide one of the first datasets that provides ground-truth annotations of image-text alignments in multi-paragraph multi-image articles. The dataset can be used to study phenomena of visual language grounding in longer documents and assess retrieval capabilities of language models trained on, e.g., captioning data. Our analyses show that there are systematic linguistic differences between the image captions and descriptive sentences from the article’s text and that intra-document retrieval is a challenging task for state-of-the-art models in L&V (CLIP, VILT, MCSE).
Anthology ID:
2024.starsem-1.8
Volume:
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Danushka Bollegala, Vered Shwartz
Venue:
*SEM
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
93–105
Language:
URL:
https://aclanthology.org/2024.starsem-1.8
DOI:
Bibkey:
Cite (ACL):
Özge Alaçam, Ronja Utescher, Hannes Grönner, Judith Sieker, and Sina Zarrieß. 2024. WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles. In Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pages 93–105, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles (Alaçam et al., *SEM 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.starsem-1.8.pdf