Abstract
In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.- Anthology ID:
- 2023.emnlp-main.682
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11067–11081
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.682
- DOI:
- 10.18653/v1/2023.emnlp-main.682
- Cite (ACL):
- Arushi Goel, Basura Fernando, Frank Keller, and Hakan Bilen. 2023. Semi-supervised multimodal coreference resolution in image narrations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11067–11081, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Semi-supervised multimodal coreference resolution in image narrations (Goel et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2023.emnlp-main.682.pdf