Tiago Freitas


2008

pdf
CORP-ORAL: Spontaneous Speech Corpus for European Portuguese
Fabíola Santos | Tiago Freitas
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Research activity on the Portuguese language for speech synthesis and recognition has suffered from a considerable lack of human and material resources. This has raised some obstacles to the development of speech technology and speech interface platforms. One of the most significant obstacles is the lack of spontaneous speech corpora for the creation, training and further improvement of speech synthesis and recognition programs. It was in order to suppress this gap that the CORP-ORAL project was planned. The aim of the project is to build a corpus of spontaneous EP available for the training of speech synthesis and recognition systems as well as phonetic, phonological, lexical, morphological and syntactic studies. Further possibilities of enquiry such as sociolinguistic and pragmatic research are also covered in the corpus design. The data consist of unscripted and unprompted face-to-face dialogues between family, friends, colleagues and unacquainted participants. All recordings are orthographically transcribed and prosodically annotated. CORP-ORAL is built from scratch with the explicit goal of becoming entirely available on the internet to the scientific community and the public in general.

pdf
Spock - a Spoken Corpus Client
Maarten Janssen | Tiago Freitas
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Spock is an open source tool for the easy deployment of time-aligned corpora. It is fully web-based, and has very limited server-side requirements. It allows the end-user to search the corpus in a text-driven manner, obtaining both the transcription and the corresponding sound fragment in the result page. Spock has an administration environment to help manage the sound files and their respective transcription files, and also provides statistical data about the files at hand. Spock uses a proprietary file format for storing the alignment data but the integrated admin environment allows you to import files from a number of common file formats. Spock is not intended as a transcriber program: it is not meant as an alternative to programs such as ELAN, Wavesurfer, or Transcriber, but rather to make corpora created with these tools easily available on line. For the end user, Spock provides a very easy way of accessing spoken corpora, without the need of installing any special software, which might make time-aligned corpora corpora accessible to a large group of users who might otherwise never look at them.