Abstract
The task of news image captioning aims to generate a detailed caption which describes the specific information of an image in a news article. However, we find that recent state-of-art models can achieve competitive performance even without vision features. To resolve the impact of vision features in the news image captioning task, we conduct extensive experiments with mainstream models based on encoder-decoder framework. From our exploration, we find 1) vision features do contribute to the generation of news image captions; 2) vision features can assist models to better generate entities of captions when the entity information is sufficient in the input textual context of the given article; 3) Regions of specific objects in images contribute to the generation of related entities in captions.- Anthology ID:
- 2023.findings-acl.818
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12923–12936
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.818
- DOI:
- 10.18653/v1/2023.findings-acl.818
- Cite (ACL):
- Junzhe Zhang and Xiaojun Wan. 2023. Exploring the Impact of Vision Features in News Image Captioning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12923–12936, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Exploring the Impact of Vision Features in News Image Captioning (Zhang & Wan, Findings 2023)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2023.findings-acl.818.pdf