WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles
Özge Alaçam, Ronja Utescher, Hannes Grönner, Judith Sieker, Sina Zarrieß
Abstract
Research in Language & Vision rarely uses naturally occurring multimodal documents as Wikipedia articles, since they feature complex image-text relations and implicit image-text alignments. In this paper, we provide one of the first datasets that provides ground-truth annotations of image-text alignments in multi-paragraph multi-image articles. The dataset can be used to study phenomena of visual language grounding in longer documents and assess retrieval capabilities of language models trained on, e.g., captioning data. Our analyses show that there are systematic linguistic differences between the image captions and descriptive sentences from the article’s text and that intra-document retrieval is a challenging task for state-of-the-art models in L&V (CLIP, VILT, MCSE).- Anthology ID:
- 2024.starsem-1.8
- Volume:
- Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Danushka Bollegala, Vered Shwartz
- Venue:
- *SEM
- SIG:
- SIGLEX
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 93–105
- Language:
- URL:
- https://preview.aclanthology.org/ingest_wac_2008/2024.starsem-1.8/
- DOI:
- 10.18653/v1/2024.starsem-1.8
- Cite (ACL):
- Özge Alaçam, Ronja Utescher, Hannes Grönner, Judith Sieker, and Sina Zarrieß. 2024. WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles. In Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pages 93–105, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles (Alaçam et al., *SEM 2024)
- PDF:
- https://preview.aclanthology.org/ingest_wac_2008/2024.starsem-1.8.pdf