Abstract
Multi-modal texts are abundant and diverse in structure, yet Language & Vision research of these naturally occurring texts has mostly focused on genres that are comparatively light on text, like tweets. In this paper, we discuss the challenges and potential benefits of a L&V framework that explicitly models referential relations, taking Wikipedia articles about buildings as an example. We briefly survey existing related tasks in L&V and propose multi-modal information extraction as a general direction for future research.- Anthology ID:
- 2021.lantern-1.5
- Volume:
- Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
- Month:
- April
- Year:
- 2021
- Address:
- Kyiv, Ukraine
- Editors:
- Marius Mosbach, Michael A. Hedderich, Sandro Pezzelle, Aditya Mogadala, Dietrich Klakow, Marie-Francine Moens, Zeynep Akata
- Venue:
- LANTERN
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 53–60
- Language:
- URL:
- https://aclanthology.org/2021.lantern-1.5
- DOI:
- Cite (ACL):
- Ronja Utescher and Sina Zarrieß. 2021. What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts. In Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), pages 53–60, Kyiv, Ukraine. Association for Computational Linguistics.
- Cite (Informal):
- What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts (Utescher & Zarrieß, LANTERN 2021)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2021.lantern-1.5.pdf