Gregor Hofer


2012

pdf
Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audio-visual speech synthesis
Dietmar Schabus | Michael Pucher | Gregor Hofer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We have created a synchronous corpus of acoustic and 3D facial marker data from multiple speakers for adaptive audio-visual text-to-speech synthesis. The corpus contains data from one female and two male speakers and amounts to 223 Austrian German sentences each. In this paper, we first describe the recording process, using professional audio equipment and a marker-based 3D facial motion capturing system for the audio-visual recordings. We then turn to post-processing, which incorporates forced alignment, principal component analysis (PCA) on the visual data, and some manual checking and corrections. Finally, we describe the resulting corpus, which will be released under a research license at the end of our project. We show that the standard PCA based feature extraction approach also works on a multi-speaker database in the adaptation scenario, where there is no data from the target speaker available in the PCA step.

2010

pdf
Resources for Speech Synthesis of Viennese Varieties
Michael Pucher | Friedrich Neubarth | Volker Strom | Sylvia Moosmüller | Gregor Hofer | Christian Kranzler | Gudrun Schuchmann | Dietmar Schabus
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes our work on developing corpora of three varieties of Viennese for unit selection speech synthesis. The synthetic voices for Viennese varieties, implemented with the open domain unit selection speech synthesis engine Multisyn of Festival will also be released within Festival. The paper especially focuses on two questions: how we selected the appropriate speakers and how we obtained the text sources needed for the recording of these non-standard varieties. Regarding the first one, it turned out that working with a ‘prototypical’ professional speaker was much more preferable than striving for authenticity. In addition, we give a brief outline about the differences between the Austrian standard and its dialectal varieties and how we solved certain technical problems that are related to these differences. In particular, the specific set of phones applicable to each variety had to be determined by applying various constraints. Since such a set does not serve any descriptive purposes but rather is influencing the quality of speech synthesis, a careful design of such a set was an important task.