Abstract
Image Captioning aims at generating a short description for an image. Existing research usually employs the architecture of CNN-RNN that views the generation as a sequential decision-making process and the entire dataset vocabulary is used as decoding space. They suffer from generating high frequent n-gram with irrelevant words. To tackle this problem, we propose to construct an image-grounded vocabulary, based on which, captions are generated with limitation and guidance. In specific, a novel hierarchical structure is proposed to construct the vocabulary incorporating both visual information and relations among words. For generation, we propose a word-aware RNN cell incorporating vocabulary information into the decoding process directly. Reinforce algorithm is employed to train the generator using constraint vocabulary as action space. Experimental results on MS COCO and Flickr30k show the effectiveness of our framework compared to some state-of-the-art models.- Anthology ID:
- P19-1652
- Volume:
- Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Anna Korhonen, David Traum, Lluís Màrquez
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6514–6524
- Language:
- URL:
- https://aclanthology.org/P19-1652
- DOI:
- 10.18653/v1/P19-1652
- Cite (ACL):
- Zhihao Fan, Zhongyu Wei, Siyuan Wang, and Xuanjing Huang. 2019. Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6514–6524, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning (Fan et al., ACL 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/P19-1652.pdf
- Code
- LibertFan/ImageCaption
- Data
- MS COCO, VQG, Visual Question Answering