Evaluating Sampling-based Filler Insertion with Spontaneous TTS

Siyang Wang, Joakim Gustafson, Éva Székely


Abstract
Inserting fillers (such as “um”, “like”) to clean speech text has a rich history of study. One major application is to make dialogue systems sound more spontaneous. The ambiguity of filler occurrence and inter-speaker difference make both modeling and evaluation difficult. In this paper, we study sampling-based filler insertion, a simple yet unexplored approach to inserting fillers. We propose an objective score called Filler Perplexity (FPP). We build three models trained on two single-speaker spontaneous corpora, and evaluate them with FPP and perceptual tests. We implement two innovations in perceptual tests, (1) evaluating filler insertion on dialogue systems output, (2) synthesizing speech with neural spontaneous TTS engines. FPP proves to be useful in analysis but does not correlate well with perceptual MOS. Perceptual results show little difference between compared filler insertion models including with ground-truth, which may be due to the ambiguity of what is good filler insertion and a strong neural spontaneous TTS that produces natural speech irrespective of input. Results also show preference for filler-inserted speech synthesized with spontaneous TTS. The same test using TTS based on read speech obtains the opposite results, which shows the importance of using spontaneous TTS in evaluating filler insertions. Audio samples: www.speech.kth.se/tts-demos/LREC22
Anthology ID:
2022.lrec-1.210
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1960–1969
Language:
URL:
https://aclanthology.org/2022.lrec-1.210
DOI:
Bibkey:
Cite (ACL):
Siyang Wang, Joakim Gustafson, and Éva Székely. 2022. Evaluating Sampling-based Filler Insertion with Spontaneous TTS. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1960–1969, Marseille, France. European Language Resources Association.
Cite (Informal):
Evaluating Sampling-based Filler Insertion with Spontaneous TTS (Wang et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2022.lrec-1.210.pdf
Data
LJSpeech