FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

Yoonhyung Lee, Hyunsin Park, Jinhwan Park, Jinkyu Lee


Abstract
Recent advances in zero-shot text-to-speech (TTS) have enabled accurate imitation of reference speech in terms of both speaking style and speaker timbre. However, achieving disentangled control over these aspects from separate references remains a challenging task. Several studies have proposed disentangled speech representations that decompose speech into interpretable attributes (e.g., timbre, prosody, and content), providing a promising foundation for TTS with attribute control from separate references. Yet, how to effectively integrate such representations into TTS systems to achieve independent and precise control remains underexplored. In this paper, we present FC-TTS, a zero-shot TTS framework that enables disentangled control of style and timbre by conditioning on two distinct reference utterances. Unlike existing systems that inherit limitations from those pre-trained disentangled representations, FC-TTS introduces key design strategies, including architectural choices, training framework, and auxiliary training objectives, which improve the reliability of attribute separation and dual-reference control. Experiments show that FC-TTS achieves high-fidelity synthesis and competitive zero-shot naturalness, while uniquely supporting consistent and independent manipulation of style and timbre. Audio samples are available at https://qualcomm-ai-research.github.io/fc-tts
Anthology ID:
2026.acl-long.173
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3772–3791
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.173/
DOI:
Bibkey:
Cite (ACL):
Yoonhyung Lee, Hyunsin Park, Jinhwan Park, and Jinkyu Lee. 2026. FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3772–3791, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations (Lee et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.173.pdf
Checklist:
 2026.acl-long.173.checklist.pdf