Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou


Abstract
Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these single-step attention are insufficient to model complex video contents and restrict the higher-level reasoning demand for this task. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on the wrong position during the cross-interaction process. Then, a modulation module is developed to capture the frame-to-frame relation with the help of sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.
Anthology ID:
2020.coling-main.167
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1841–1851
Language:
URL:
https://aclanthology.org/2020.coling-main.167
DOI:
10.18653/v1/2020.coling-main.167
Bibkey:
Cite (ACL):
Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020. Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1841–1851, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network (Liu et al., COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.coling-main.167.pdf
Data
CharadesCharades-STA