Abstract
Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. We present a Transformer-based model with the ability to produce captions focused on specific objects, concepts or actions in an image by providing them as guiding text to the model. Further, we evaluate the quality of these guided captions when trained on Conceptual Captions which contain 3.3M image-level captions compared to Visual Genome which contain 3.6M object-level captions. Counter-intuitively, we find that guided captions produced by the model trained on Conceptual Captions generalize better on out-of-domain data. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.- Anthology ID:
- 2021.conll-1.14
- Volume:
- Proceedings of the 25th Conference on Computational Natural Language Learning
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Venue:
- CoNLL
- SIG:
- SIGNLL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 183–193
- Language:
- URL:
- https://aclanthology.org/2021.conll-1.14
- DOI:
- 10.18653/v1/2021.conll-1.14
- Cite (ACL):
- Edwin G. Ng, Bo Pang, Piyush Sharma, and Radu Soricut. 2021. Understanding Guided Image Captioning Performance across Domains. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 183–193, Online. Association for Computational Linguistics.
- Cite (Informal):
- Understanding Guided Image Captioning Performance across Domains (Ng et al., CoNLL 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.conll-1.14.pdf
- Code
- google-research-datasets/T2-Guiding
- Data
- T2 Guiding, Conceptual Captions, Localized Narratives, Visual Genome