Abstract
A lot of recent work in Language & Vision has looked at generating descriptions or referring expressions for objects in scenes of real-world images, though focusing mostly on relatively simple language like object names, color and location attributes (e.g., brown chair on the left). This paper presents work on Draw-and-Tell, a dataset of detailed descriptions for common objects in images where annotators have produced fine-grained attribute-centric expressions distinguishing a target object from a range of similar objects. Additionally, the dataset comes with hand-drawn sketches for each object. As Draw-and-Tell is medium-sized and contains a rich vocabulary, it constitutes an interesting challenge for CNN-LSTM architectures used in state-of-the-art image captioning models. We explore whether the additional modality given through sketches can help such a model to learn to accurately ground detailed language referring expressions to object shapes. Our results are encouraging.- Anthology ID:
- W19-8618
- Volume:
- Proceedings of the 12th International Conference on Natural Language Generation
- Month:
- October–November
- Year:
- 2019
- Address:
- Tokyo, Japan
- Venue:
- INLG
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 136–140
- Language:
- URL:
- https://aclanthology.org/W19-8618
- DOI:
- 10.18653/v1/W19-8618
- Cite (ACL):
- Ting Han and Sina Zarrieß. 2019. Sketch Me if You Can: Towards Generating Detailed Descriptions of Object Shape by Grounding in Images and Drawings. In Proceedings of the 12th International Conference on Natural Language Generation, pages 136–140, Tokyo, Japan. Association for Computational Linguistics.
- Cite (Informal):
- Sketch Me if You Can: Towards Generating Detailed Descriptions of Object Shape by Grounding in Images and Drawings (Han & Zarrieß, INLG 2019)
- PDF:
- https://preview.aclanthology.org/auto-file-uploads/W19-8618.pdf