Retrieval-augmented Video Encoding for Instructional Captioning
Yeonjoon Jung, Minsoo Kim, Seungtaek Choi, Jihyuk Kim, Minji Seo, Seung-won Hwang
Abstract
Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction.A unique challenge posed by instructional videos is key-object degeneracy, where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.- Anthology ID:
- 2023.findings-acl.543
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8554–8568
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.findings-acl.543/
- DOI:
- 10.18653/v1/2023.findings-acl.543
- Cite (ACL):
- Yeonjoon Jung, Minsoo Kim, Seungtaek Choi, Jihyuk Kim, Minji Seo, and Seung-won Hwang. 2023. Retrieval-augmented Video Encoding for Instructional Captioning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8554–8568, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Retrieval-augmented Video Encoding for Instructional Captioning (Jung et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.findings-acl.543.pdf