Learning to Describe Differences Between Pairs of Similar Images

Harsh Jhamtani, Taylor Berg-Kirkpatrick


Abstract
In this paper, we introduce the task of automatically generating text to describe the differences between two similar images. We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage. Annotators were asked to succinctly describe all the differences in a short paragraph. As a result, our novel dataset provides an opportunity to explore models that align language and vision, and capture visual salience. The dataset may also be a useful benchmark for coherent multi-sentence generation. We perform a first-pass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences. We find that, for both single-sentence generation and as well as multi-sentence generation, the proposed model outperforms the models that use attention alone.
Anthology ID:
D18-1436
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4024–4034
Language:
URL:
https://aclanthology.org/D18-1436
DOI:
10.18653/v1/D18-1436
Bibkey:
Cite (ACL):
Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to Describe Differences Between Pairs of Similar Images. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4024–4034, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Learning to Describe Differences Between Pairs of Similar Images (Jhamtani & Berg-Kirkpatrick, EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/D18-1436.pdf
Code
 harsh19/spot-the-diff
Data
Spot-the-diff