Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan; Jiachen Lian; Hirofumi Inaguma; Paden Tomasello; Philipp Koehn; Xutai Ma

doi:10.18653/v1/2025.findings-emnlp.140

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma

Abstract

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.

Anthology ID:: 2025.findings-emnlp.140
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2600–2617
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.140/
DOI:: 10.18653/v1/2025.findings-emnlp.140
Bibkey:
Cite (ACL):: Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, and Xutai Ma. 2025. Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2600–2617, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation (Tan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.140.pdf
Checklist:: 2025.findings-emnlp.140.checklist.pdf

PDF Cite Search Checklist Fix data