Abstract
Existing multilingual video corpus moment retrieval (mVCMR) methods are mainly based on a two-stream structure. The visual stream utilizes the visual content in the video to estimate the query-visual similarity, and the subtitle stream exploits the query-subtitle similarity. The final query-video similarity ensembles similarities from two streams. In our work, we pro- pose a simple and effective strategy termed as Cross-lingual Cross-modal Consolidation (C3 ) to improve mVCMR accuracy. We adopt the ensemble similarity as the teacher to guide the training of each stream, leading to a more powerful ensemble similarity. Meanwhile, we use the teacher for a specific language to guide the student for another language to exploit the complementary knowledge across languages. Ex- tensive experiments on mTVR dataset demonstrate the effectiveness of our C3 method.- Anthology ID:
- 2022.findings-naacl.142
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2022
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Editors:
- Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1854–1862
- Language:
- URL:
- https://aclanthology.org/2022.findings-naacl.142
- DOI:
- 10.18653/v1/2022.findings-naacl.142
- Cite (ACL):
- Jiaheng Liu, Tan Yu, Hanyu Peng, Mingming Sun, and Ping Li. 2022. Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1854–1862, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval (Liu et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2022.findings-naacl.142.pdf
- Data
- TVR, mTVR