Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models

Sara Pernille Jensen; Hallvard Innset Hurum; Anna-Maria Christodoulou

Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models

Sara Pernille Jensen, Hallvard Innset Hurum, Anna-Maria Christodoulou

Abstract

Audio-video question answering (AVQA) systems for music show signs of multimodal "understanding", but it is unclear which inputs they rely on or whether their behavior reflects genuine audio-video reasoning. Existing evaluations focus on overall accuracy and rarely examine modality dependence. We address this gap by suggesting a method of using counterfactual evaluations to analyse the audio-video understanding of the models, illustrated with a case study on the audio-video spatial-temporal (AVST) architecture. This includes interventions that zero out or swap audio, video, or both, where results are benchmarked against a baseline based on linguistic patterns alone. Results show stronger reliance on audio than video, yet performance persists when either modality is removed, indicating learned cross-modal representations. The AVQA system studied thus exhibits non-trivial multimodal integration, though its "understanding" remains uneven.

Anthology ID:: 2026.nlp4musa-1.3
Volume:: Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Elena V. Epure, Sergio Oramas, SeungHeon Doh, Pedro Ramoneda, Anna Kruspe, Mohamed Sordo
Venues:: NLP4MusA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13–19
Language:
URL:: https://preview.aclanthology.org/manual-author-scripts/2026.nlp4musa-1.3/
DOI:
Bibkey:
Cite (ACL):: Sara Pernille Jensen, Hallvard Innset Hurum, and Anna-Maria Christodoulou. 2026. Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 13–19, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models (Jensen et al., NLP4MusA 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/manual-author-scripts/2026.nlp4musa-1.3.pdf

PDF Cite Search Fix data