Task-Specific Information Decomposition for End-to-End Dense Video Captioning

Zhiyue Liu; Xinru Zhang; Jinyuan Liu

Task-Specific Information Decomposition for End-to-End Dense Video Captioning

Abstract

Dense video captioning aims to localize events within input videos and generate concise descriptive texts for each event. Advanced end-to-end methods require both tasks to share the same intermediate features that serve as event queries, thereby enabling the mutual promotion of two tasks. However, relying on shared queries limits the model’s ability to extract task-specific information, as event semantic perception and localization demand distinct perspectives on video understanding. To address this, we propose a decomposed dense video captioning framework that derives localization and captioning queries from event queries, enabling task-specific representations while maintaining inter-task collaboration. Considering the roles of different queries, we design a contrastive semantic optimization strategy that guides localization queries to focus on event-level visual features and captioning queries to align with textual semantics. Besides, only localization information is considered in existing methods for label assignment, failing to ensure the relevance of the selected queries to descriptions. We jointly consider localization and captioning losses to achieve a semantically balanced assignment process. Extensive experiments on the YouCook2 and ActivityNet Captions datasets demonstrate that our framework achieves state-of-the-art performance.

Anthology ID:: 2025.acl-long.807
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16524–16536
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.807/
DOI:
Bibkey:
Cite (ACL):: Zhiyue Liu, Xinru Zhang, and Jinyuan Liu. 2025. Task-Specific Information Decomposition for End-to-End Dense Video Captioning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16524–16536, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Task-Specific Information Decomposition for End-to-End Dense Video Captioning (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.807.pdf

PDF Cite Search Fix data