Abstract
It is challenging for models to understand complex, multimodal content such as television clips, and this is in part because video-language models often rely on single-modality reasoning and lack interpretability. To combat these issues we propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by searching for trees of entailment relationships between simple text-video evidence and higher-level conclusions that prove question-answer pairs. We also introduce the task of multimodal entailment tree generation to evaluate reasoning quality. Our method’s performance on the challenging TVQA benchmark demonstrates interpretable, state-of-the-art zero-shot performance on full clips, illustrating that multimodal entailment tree generation can be a best-of-both-worlds alternative to black-box systems.- Anthology ID:
- 2024.emnlp-main.1059
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19009–19028
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.1059
- DOI:
- 10.18653/v1/2024.emnlp-main.1059
- Cite (ACL):
- Kate Sanders, Nathaniel Weir, and Benjamin Van Durme. 2024. TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19009–19028, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning (Sanders et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.1059.pdf