Natural Language Video Localization with Learnable Moment Proposals

Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, Jun Xiao


Abstract
Given an untrimmed video and a natural language query, Natural Language Video Localization (NLVL) aims to identify the video moment described by query. To address this task, existing methods can be roughly grouped into two groups: 1) propose-and-rank models first define a set of hand-designed moment candidates and then find out the best-matching one. 2) proposal-free models directly predict two temporal boundaries of the referential moment from frames. Currently, almost all the propose-and-rank methods have inferior performance than proposal-free counterparts. In this paper, we argue that the performance of propose-and-rank models are underestimated due to the predefined manners: 1) Hand-designed rules are hard to guarantee the complete coverage of targeted segments. 2) Densely sampled candidate moments cause redundant computation and degrade the performance of ranking process. To this end, we propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals. The position and length of these proposals are dynamically adjusted during training process. Moreover, a boundary-aware loss has been proposed to leverage frame-level information and further improve performance. Extensive ablations on two challenging NLVL benchmarks have demonstrated the effectiveness of LPNet over existing state-of-the-art methods.
Anthology ID:
2021.emnlp-main.327
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4008–4017
Language:
URL:
https://aclanthology.org/2021.emnlp-main.327
DOI:
10.18653/v1/2021.emnlp-main.327
Bibkey:
Cite (ACL):
Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, and Jun Xiao. 2021. Natural Language Video Localization with Learnable Moment Proposals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4008–4017, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Natural Language Video Localization with Learnable Moment Proposals (Xiao et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2021.emnlp-main.327.pdf
Video:
 https://preview.aclanthology.org/naacl24-info/2021.emnlp-main.327.mp4
Code
 xiaoneil/lpnet
Data
ActivityNet CaptionsCharades-STA