Sérgio Paulo
2026
FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions
Francisco Teixeira | Carlos Carvalho | Mariana Julião | Catarina Botelho | Rubén Solera-Ureña | Sérgio Paulo | Thomas Rolland | Ben Peters | Isabel Trancoso | Alberto Abad
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Francisco Teixeira | Carlos Carvalho | Mariana Julião | Catarina Botelho | Rubén Solera-Ureña | Sérgio Paulo | Thomas Rolland | Ben Peters | Isabel Trancoso | Alberto Abad
Proceedings of the Fifteenth Language Resources and Evaluation Conference
State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.
2014
SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling
Arantza del Pozo | Carlo Aliprandi | Aitor Álvarez | Carlos Mendes | Joao P. Neto | Sérgio Paulo | Nicola Piccinini | Matteo Raffaelli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Arantza del Pozo | Carlo Aliprandi | Aitor Álvarez | Carlos Mendes | Joao P. Neto | Sérgio Paulo | Nicola Piccinini | Matteo Raffaelli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the data collection, annotation and sharing activities carried out within the FP7 EU-funded SAVAS project. The project aims to collect, share and reuse audiovisual language resources from broadcasters and subtitling companies to develop large vocabulary continuous speech recognisers in specific domains and new languages, with the purpose of solving the automated subtitling needs of the media industry.
2008
Methodologies for Designing and Recording Speech Databases for Corpus Based Synthesis
Luís Oliveira | Sérgio Paulo | Luís Figueira | Carlos Mendes | Ana Nunes | Joaquim Godinho
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Luís Oliveira | Sérgio Paulo | Luís Figueira | Carlos Mendes | Ana Nunes | Joaquim Godinho
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we share our experience and describe the methodologies that we have used in designing and recording large speech databases for applications requiring speech synthesis. Given the growing demand for customized and domain specific voices for use in corpus based synthesis systems, we believe that good practices should be established for the creation of these databases which are a key factor in the quality of the resulting speech synthesizer. We will focus on the designing of the recording prompts, on the speaker selection procedure, on the recording setup and on the quality control of the resulting database. One of the major challenges was to assure the uniformity of the recordings during the 20 two-hour recording sessions that each speaker had to perform, to produce a total of 13 hours of recorded speech for each of the four speakers. This work was conducted in the scope of the Tecnovoz project that brought together 4 speech research centers and 9 companies with the goal of integrating speech technologies in a wide range of applications.