What You See is What You Ask: Evaluating Audio Descriptions

Divy Kala, Eshika Khandelwal, Makarand Tapaswi


Abstract
Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.
Anthology ID:
2025.emnlp-main.1199
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23507–23529
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1199/
DOI:
10.18653/v1/2025.emnlp-main.1199
Bibkey:
Cite (ACL):
Divy Kala, Eshika Khandelwal, and Makarand Tapaswi. 2025. What You See is What You Ask: Evaluating Audio Descriptions. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23507–23529, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
What You See is What You Ask: Evaluating Audio Descriptions (Kala et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1199.pdf
Checklist:
 2025.emnlp-main.1199.checklist.pdf