Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng, Hairong Liu, Kainan Peng, Kenneth Church, Liang Huang


Abstract
Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.
Anthology ID:
2020.findings-emnlp.346
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3886–3896
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.346
DOI:
10.18653/v1/2020.findings-emnlp.346
Bibkey:
Cite (ACL):
Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng, Hairong Liu, Kainan Peng, Kenneth Church, and Liang Huang. 2020. Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3886–3896, Online. Association for Computational Linguistics.
Cite (Informal):
Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework (Ma et al., Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2020.findings-emnlp.346.pdf