Expressing Visual Relationships via Language

Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, Mohit Bansal


Abstract
Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation, and retrieval), generating relational captions for two images, can also be very useful. This important problem has not been explored mostly due to lack of datasets and effective models. To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions. We then propose a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention. We also extend the model with dynamic relational attention, which calculates visual alignment while decoding. Our models are evaluated on our newly collected and two public datasets consisting of image pairs annotated with relationship sentences. Experimental results, based on both automatic and human evaluation, demonstrate that our model outperforms all baselines and existing methods on all the datasets.
Anthology ID:
P19-1182
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1873–1883
Language:
URL:
https://aclanthology.org/P19-1182
DOI:
10.18653/v1/P19-1182
Bibkey:
Cite (ACL):
Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, and Mohit Bansal. 2019. Expressing Visual Relationships via Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1873–1883, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Expressing Visual Relationships via Language (Tan et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/P19-1182.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-2/P19-1182.mp4
Code
 airsplay/VisualRelationships
Data
Image Editing Request DatasetMS COCOSpot-the-diff