M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang


Abstract
Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M3AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M3AV makes it a challenging dataset.
Anthology ID:
2024.acl-long.489
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9041–9060
Language:
URL:
https://aclanthology.org/2024.acl-long.489
DOI:
10.18653/v1/2024.acl-long.489
Bibkey:
Cite (ACL):
Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, and Yanfeng Wang. 2024. M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9041–9060, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset (Chen et al., ACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.489.pdf
Video:
 https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.489.mp4