Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma


Abstract
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
Anthology ID:
2025.findings-emnlp.140
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2600–2617
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.140/
DOI:
10.18653/v1/2025.findings-emnlp.140
Bibkey:
Cite (ACL):
Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, and Xutai Ma. 2025. Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2600–2617, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation (Tan et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.140.pdf
Checklist:
 2025.findings-emnlp.140.checklist.pdf