Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches

Tsutomu Hirao; Naoki Kobayashi; Hidetaka Kamigaito; Manabu Okumura; Akisato Kimura

doi:10.18653/v1/2024.findings-emnlp.581

Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches

Tsutomu Hirao, Naoki Kobayashi, Hidetaka Kamigaito, Manabu Okumura, Akisato Kimura

Abstract

This paper tackles a new task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships. We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves. We then evaluate baseline approaches to video RST parsing: the ‘parsing after captioning’ framework and parsing via visual features. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story. Furthermore, we demonstrated that RST trees obtained from videos contribute to multimodal summarization consisting of keyframes with texts.

Anthology ID:: 2024.findings-emnlp.581
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9943–9958
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.581/
DOI:: 10.18653/v1/2024.findings-emnlp.581
Bibkey:
Cite (ACL):: Tsutomu Hirao, Naoki Kobayashi, Hidetaka Kamigaito, Manabu Okumura, and Akisato Kimura. 2024. Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9943–9958, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches (Hirao et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.581.pdf

PDF Cite Search Fix data