HFSD-V2C: Zero-Shot Visual Voice Cloning Via Hierarchical Face-Styled Diffusion Model

Yaping Liu; Linqin Wang; Shengxiang Gao; Zhengtao Yu (余正涛); Ling Dong

HFSD-V2C: Zero-Shot Visual Voice Cloning Via Hierarchical Face-Styled Diffusion Model

Yaping Liu, Linqin Wang, Shengxiang Gao, Zhengtao Yu, Ling Dong

Abstract

"The goal of this work is zero-shot visual voice cloning (ZS-V2C), which aims to generate speech samples with unseen speaker identity and prosody derived from a video clip and an acoustic reference. ZS-V2C presents greater challenges as: 1) unseen speaker modeling; and 2) unseen prosody modeling. Unlike previous works, we propose a novel ZS-V2C framework that incorporates a hierarchical face-styled diffusion model (HFSD-V2C). Specifically, first, we leverage cross-modal biometrics to predict unseen speaker embeddings based on facial features. Then, we jointly model the unseen prosodic features at the text, speech and video levels. Finally, a diffusion model is constructed based on the embeddings of the unseen speaker and prosodic features,enabling the generation of expressive and diverse speech. Extensive experiments on the LRS2and GRID benchmark dataset demonstrate the superior performance of our proposed method."

Anthology ID:: 2025.ccl-1.77
Volume:: Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Month:: August
Year:: 2025
Address:: Jinan, China
Editors:: Maosong Sun, Peiyong Duan, Zhiyuan Liu, Ruifeng Xu, Weiwei Sun
Venue:: CCL
SIG:
Publisher:: Chinese Information Processing Society of China
Note:
Pages:: 1020–1030
Language:
URL:: https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.77/
DOI:
Bibkey:
Cite (ACL):: Yaping Liu, Linqin Wang, Shengxiang Gao, Zhengtao Yu, and Ling Dong. 2025. HFSD-V2C: Zero-Shot Visual Voice Cloning Via Hierarchical Face-Styled Diffusion Model. In Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025), pages 1020–1030, Jinan, China. Chinese Information Processing Society of China.
Cite (Informal):: HFSD-V2C: Zero-Shot Visual Voice Cloning Via Hierarchical Face-Styled Diffusion Model (Liu et al., CCL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.77.pdf

PDF Cite Search Fix data