SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

Tan-Hanh Pham, Le Hoang Nam, Phu-Vinh Nguyen, Chris Ngo, Truong-Son Hy


Abstract
Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To our knowledge, SilVar is the first open-source, speech-driven VLM. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.
Anthology ID:
2025.emnlp-main.589
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11674–11685
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.589/
DOI:
Bibkey:
Cite (ACL):
Tan-Hanh Pham, Le Hoang Nam, Phu-Vinh Nguyen, Chris Ngo, and Truong-Son Hy. 2025. SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11674–11685, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization (Pham et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.589.pdf
Checklist:
 2025.emnlp-main.589.checklist.pdf