RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas; Mohammad Nur Hossain Khan; Bashima Islam

doi:10.18653/v1/2025.emnlp-main.96

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Abstract

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning - each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio-Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks - including egocentric and exocentric tasks - show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.

Anthology ID:: 2025.emnlp-main.96
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1868–1894
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.96/
DOI:: 10.18653/v1/2025.emnlp-main.96
Bibkey:
Cite (ACL):: Subrata Biswas, Mohammad Nur Hossain Khan, and Bashima Islam. 2025. RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1868–1894, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language (Biswas et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.96.pdf
Checklist:: 2025.emnlp-main.96.checklist.pdf

PDF Cite Search Checklist Fix data