Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Zhenfang Chen; Lin Ma; Wenhan Luo; Kwan-Yee Kenneth Wong

doi:10.18653/v1/P19-1183

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Zhenfang Chen, Lin Ma, Wenhan Luo, Kwan-Yee Kenneth Wong

Abstract

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our newly proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train our attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. We also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Results from extensive experiments demonstrate the superiority of our model over the baseline approaches.

Anthology ID:: P19-1183
Volume:: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2019
Address:: Florence, Italy
Editors:: Anna Korhonen, David Traum, Lluís Màrquez
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1884–1894
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/P19-1183/
DOI:: 10.18653/v1/P19-1183
Bibkey:
Cite (ACL):: Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1884–1894, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video (Chen et al., ACL 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/P19-1183.pdf
Supplementary:: P19-1183.Supplementary.pdf
Video:: https://preview.aclanthology.org/fix-sig-urls/P19-1183.mp4
Code: JeffCHEN2017/WSSTG

PDF Cite Search Code Supplementary Video Fix data