Neha S
2024
ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations
Neil Shah
|
Saiteja Kosgi
|
Vishal Tambrahalli
|
Neha S
|
Anil Nelakanti
|
Vineet Gandhi
Findings of the Association for Computational Linguistics: EACL 2024
We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/
Search