Abstract
Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels,we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.- Anthology ID:
- 2021.findings-emnlp.9
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 89–99
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.9
- DOI:
- 10.18653/v1/2021.findings-emnlp.9
- Cite (ACL):
- Yuechen Wang, Wengang Zhou, and Houqiang Li. 2021. Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 89–99, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding (Wang et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2021.findings-emnlp.9.pdf
- Data
- ActivityNet, ActivityNet Captions, DiDeMo