Abstract
A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.- Anthology ID:
- 2021.emnlp-main.733
- Volume:
- Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Online and Punta Cana, Dominican Republic
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 9302–9311
- Language:
- URL:
- https://aclanthology.org/2021.emnlp-main.733
- DOI:
- 10.18653/v1/2021.emnlp-main.733
- Cite (ACL):
- Daizong Liu, Xiaoye Qu, and Pan Zhou. 2021. Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9302–9311, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding (Liu et al., EMNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.emnlp-main.733.pdf
- Data
- ActivityNet Captions, Charades, Charades-STA