UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation

Zhengyan Sheng; Zhihao Du; Heng Lu; ShiLiang Zhang; Zhen-Hua Ling

doi:10.18653/v1/2025.findings-emnlp.1381

UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation

Zhengyan Sheng, Zhihao Du, Heng Lu, ShiLiang Zhang, Zhen-Hua Ling

Abstract

While recent advances in reference-based speaker cloning have significantly improved the authenticity of synthetic speech, speaker generation driven by multimodal cues such as visual appearance, textual descriptions, and other biometric signals remains in its early stages. To pioneer truly multimodal-controllable speaker generation, we propose UniSpeaker, the first framework supporting unified voice synthesis from arbitrary modality combinations. Specifically, self-distillation is firstly applied to a large-scale speech generation model for speaker disentanglement. To overcome data sparsity and one-to-many mapping challenges, a novel KV-Former based unified voice aggregator is introduced, where multiple modalities are projected into a shared latent space through soft contrastive learning to ensure accurate alignment with user-specified vocal characteristics. Additionally, to advance the field, the first Multimodal Voice Control (MVC) benchmark is established to evaluate voice suitability, diversity, and quality. When tested across five MVC tasks, UniSpeaker is shown to surpass existing modality-specific models. Speech samples and the MVC benchmark are available at https://UniSpeaker.github.io.

Anthology ID:: 2025.findings-emnlp.1381
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25331–25346
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1381/
DOI:: 10.18653/v1/2025.findings-emnlp.1381
Bibkey:
Cite (ACL):: Zhengyan Sheng, Zhihao Du, Heng Lu, ShiLiang Zhang, and Zhen-Hua Ling. 2025. UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25331–25346, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation (Sheng et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1381.pdf
Checklist:: 2025.findings-emnlp.1381.checklist.pdf

PDF Cite Search Checklist Fix data