UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation
Zhengyan Sheng, Zhihao Du, Heng Lu, ShiLiang Zhang, Zhen-Hua Ling
Abstract
While recent advances in reference-based speaker cloning have significantly improved the authenticity of synthetic speech, speaker generation driven by multimodal cues such as visual appearance, textual descriptions, and other biometric signals remains in its early stages. To pioneer truly multimodal-controllable speaker generation, we propose UniSpeaker, the first framework supporting unified voice synthesis from arbitrary modality combinations. Specifically, self-distillation is firstly applied to a large-scale speech generation model for speaker disentanglement. To overcome data sparsity and one-to-many mapping challenges, a novel KV-Former based unified voice aggregator is introduced, where multiple modalities are projected into a shared latent space through soft contrastive learning to ensure accurate alignment with user-specified vocal characteristics. Additionally, to advance the field, the first Multimodal Voice Control (MVC) benchmark is established to evaluate voice suitability, diversity, and quality. When tested across five MVC tasks, UniSpeaker is shown to surpass existing modality-specific models. Speech samples and the MVC benchmark are available at https://UniSpeaker.github.io.- Anthology ID:
- 2025.findings-emnlp.1381
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25331–25346
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1381/
- DOI:
- 10.18653/v1/2025.findings-emnlp.1381
- Cite (ACL):
- Zhengyan Sheng, Zhihao Du, Heng Lu, ShiLiang Zhang, and Zhen-Hua Ling. 2025. UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25331–25346, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- UniSpeaker: A Unified Approach for Multimodality-driven Speaker Generation (Sheng et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1381.pdf