Laure Charonnat


2012

pdf
Towards Fully Automatic Annotation of Audio Books for TTS
Olivier Boeffard | Laure Charonnat | Sébastien Le Maguer | Damien Lolive
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Building speech corpora is a first and crucial step for every text-to-speech synthesis system. Nowadays, the use of statistical models implies the use of huge sized corpora that need to be recorded, transcribed, annotated and segmented to be usable. The variety of corpora necessary for recent applications (content, style, etc.) makes the use of existing digital audio resources very attractive. Among all available resources, audiobooks, considering their quality, are interesting. Considering this framework, we propose a complete acquisition, segmentation and annotation chain for audiobooks that tends to be fully automatic. The proposed process relies on a data structure, Roots, that establishes the relations between the different annotation levels represented as sequences of items. This methodology has been applied successfully on 11 hours of speech extracted from an audiobook. A manual check, on a part of the corpus, shows the efficiency of the process.

pdf
Vers une annotation automatique de corpus audio pour la synthèse de parole (Towards Fully Automatic Annotation of Audio Books for Text-To-Speech (TTS) Synthesis) [in French]
Olivier Boëffard | Laure Charonnat | Sébastien Le Maguer | Damien Lolive | Gaëlle Vidal
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

2008

pdf
Automatic Phone Segmentation of Expressive Speech
Laure Charonnat | Gaëlle Vidal | Olivier Boeffard
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In order to improve the flexibility and the precision of an automatic phone segmentation system for a type of expressive speech, the dubbing into French of fiction movies, we developed both the phonetic labeling process and the alignment process. The automatic labelling system relies on an automatic grapheme-to-phoneme conversion including all the variants of the phonetic chain and on HMM modeling. In this article, we will distinguish three sets of phone models: a set of context independent models, a set of left and right context dependant models and finally a mixing of the two that combines phone and triphone models according to the precision of alignment obtained for each phonetic broad-class. The three models are evaluated on a test corpus. On the one hand we notice a little decrease in the score of phonetic labelling mainly due to pauses insertions, but on the other hand the mixed set of models gives the best results for the score of precision of the alignment.