Abstract
Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results—improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality of the retrieved results.- Anthology ID:
- K19-1006
- Volume:
- Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Mohit Bansal, Aline Villavicencio
- Venue:
- CoNLL
- SIG:
- SIGNLL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 55–65
- Language:
- URL:
- https://aclanthology.org/K19-1006
- DOI:
- 10.18653/v1/K19-1006
- Cite (ACL):
- Gabriel Ilharco, Yuan Zhang, and Jason Baldridge. 2019. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Large-Scale Representation Learning from Visually Grounded Untranscribed Speech (Ilharco et al., CoNLL 2019)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/K19-1006.pdf
- Data
- Conceptual Captions, MS COCO