Xiaojun Ye


2024

pdf
MMAD:Multi-modal Movie Audio Description
Xiaojun Ye | Junhao Chen | Xiang Li | Haidong Xin | Chao Li | Sheng Zhou | Jiajun Bu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Audio Description (AD) aims to generate narrations of information that is not accessible through unimodal hearing in movies to aid the visually impaired in following film narratives. Current solutions rely heavily on manual work, resulting in high costs and limited scalability. While automatic methods have been introduced, they often yield descriptions that are sparse and omit key details. ddressing these challenges, we propose a novel automated pipeline, the Multi-modal Movie Audio Description (MMAD). MMAD harnesses the capabilities of three key modules as well as the power of Llama2 to augment the depth and breadth of the generated descriptions. Specifically, first, we propose an Audio-aware Feature Enhancing Module to provide the model with multi-modal perception capabilities, enriching the background descriptions with a more comprehensive understanding of the environmental features. Second, we propose an Actor-tracking-aware Story Linking Module to aid in the generation of contextual and character-centric descriptions, thereby enhancing the richness of character depictions. Third, we incorporate a Subtitled Movie Clip Contextual Alignment Module, supplying semantic information about various time periods throughout the movie, which facilitates the consideration of the full movie narrative context when describing silent segments, thereby enhancing the richness of the descriptions. Experiments on widely used datasets convincingly demonstrates that MMAD significantly surpasses existing strong baselines in performance, establishing a new state-of-the-art in the field. Our code will be released at https://github.com/Daria8976/MMAD.