2024
pdf
abs
FalAI: A Dataset for End-to-end Spoken Language Understanding in a Low-Resource Scenario
Andres Pineiro-Martin
|
Carmen Garcia-Mateo
|
Laura Docio-Fernandez
|
Maria del Carmen Lopez-Perez
|
Jose Gandarela-Rodriguez
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
End-to-end (E2E) Spoken Language Understanding (SLU) systems infer structured information directly from the speech signal using a single model. Due to the success of virtual assistants and the increasing demand for speech interfaces, these architectures are being actively researched for their potential to improve system performance by exploiting acoustic information and avoiding the cascading errors of traditional architectures. However, these systems require large amounts of specific, well-labelled speech data for training, which is expensive to obtain even in English, where the number of public audio datasets for SLU is limited. In this paper, we release the FalAI dataset, the largest public SLU dataset in terms of hours (250 hours), recordings (260,000) and participants (over 10,000), which is also the first SLU dataset in Galician and the first to be obtained in a low-resource scenario. Furthermore, we present new measures of complexity for the text corpora, the strategies followed for the design, collection and validation of the dataset, and we define splits for noisy audio, hesitant audio and audio where the sentence has changed but the structured information is preserved. These novel splits provide a unique resource for testing SLU systems in challenging, real-world scenarios.
pdf
GiDi: A Virtual Assistant for Screening Protocols at Home
Andrés Piñeiro-Martín
|
Carmen García-Mateo
|
Laura Docío-Fernández
|
María del Carmen López-Pérez
|
Ignacio Novo-Veleiro
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
2020
pdf
abs
LSE_UVIGO: A Multi-source Database for Spanish Sign Language Recognition
Laura Docío-Fernández
|
José Luis Alba-Castro
|
Soledad Torres-Guijarro
|
Eduardo Rodríguez-Banga
|
Manuel Rey-Area
|
Ania Pérez-Pérez
|
Sonia Rico-Alonso
|
Carmen García-Mateo
Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives
This paper presents LSE_UVIGO, a multi-source database designed to foster research on Sign Language Recognition. It is being recorded and compiled for Spanish Sign Language (LSE acronym in Spanish) and contains also spoken Galician language, so it is very well fitted to research on these languages, but also quite useful for fundamental research in any other sign language. LSE_UVIGO is composed of two datasets: LSE_Lex40_UVIGO, a multi-sensor and multi-signer dataset acquired from scratch, designed as an incremental dataset, both in complexity of the visual content and in the variety of signers. It contains static and co-articulated sign recordings, fingerspelled and gloss-based isolated words, and sentences. Its acquisition is done in a controlled lab environment in order to obtain good quality videos with sharp video frames and RGB and depth information, making them suitable to try different approaches to automatic recognition. The second subset, LSE_TVGWeather_UVIGO is being populated from the regional television weather forecasts interpreted to LSE, as a faster way to acquire high quality, continuous LSE recordings with a domain-restricted vocabulary and with a correspondence to spoken sentences.
2016
pdf
abs
CORILSE: a Spanish Sign Language Repository for Linguistic Analysis
María del Carmen Cabeza-Pereiro
|
José Mª Garcia-Miguel
|
Carmen García Mateo
|
José Luis Alba Castro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
CORILSE is a computerized corpus of Spanish Sign Language (Lengua de Signos Española, LSE). It consists of a set of recordings from different discourse genres by Galician signers living in the city of Vigo. In this paper we describe its annotation system, developed on the basis of pre-existing ones (mostly the model of Auslan corpus). This includes primary annotation of id-glosses for manual signs, annotation of non-manual component, and secondary annotation of grammatical categories and relations, because this corpus is been built for grammatical analysis, in particular argument structures in LSE. Up until this moment the annotation has been basically made by hand, which is a slow and time-consuming task. The need to facilitate this process leads us to engage in the development of automatic or semi-automatic tools for manual and facial recognition. Finally, we also present the web repository that will make the corpus available to different types of users, and will allow its exploitation for research purposes and other applications (e.g. teaching of LSE or design of tasks for signed language assessment).
pdf
abs
Introducing the SEA_AP: an Enhanced Tool for Automatic Prosodic Analysis
Marta Martínez
|
Rocío Varela
|
Carmen García Mateo
|
Elisa Fernández Rei
|
Adela Martínez Calvo
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
SEA_AP (Segmentador e Etiquetador Automático para Análise Prosódica, Automatic Segmentation and Labelling for Prosodic Analysis) toolkit is an application that performs audio segmentation and labelling to create a TextGrid file which will be used to launch a prosodic analysis using Praat. In this paper, we want to describe the improved functionality of the tool achieved by adding a dialectometric analysis module using R scripts. The dialectometric analysis includes computing correlations among F0 curves and it obtains prosodic distances among the different variables of interest (location, speaker, structure, etc.). The dialectometric analysis requires large databases in order to be adequately computed, and automatic segmentation and labelling can create them thanks to a procedure less costly than the manual alternative. Thus, the integration of these tools into the SEA_AP allows to propose a distribution of geoprosodic areas by means of a quantitative method, which completes the traditional dialectological point of view. The current version of the SEA_AP toolkit is capable of analysing Galician, Spanish and Brazilian Portuguese data, and hence the distances between several prosodic linguistic varieties can be measured at present.
pdf
abs
Enhanced CORILGA: Introducing the Automatic Phonetic Alignment Tool for Continuous Speech
Roberto Seara
|
Marta Martinez
|
Rocío Varela
|
Carmen García Mateo
|
Elisa Fernandez Rei
|
Xosé Luis Regueira
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The “Corpus Oral Informatizado da Lingua Galega (CORILGA)” project aims at building a corpus of oral language for Galician, primarily designed to study the linguistic variation and change. This project is currently under development and it is periodically enriched with new contributions. The long-term goal is that all the speech recordings will be enriched with phonetic, syllabic, morphosyntactic, lexical and sentence ELAN-complaint annotations. A way to speed up the process of annotation is to use automatic speech-recognition-based tools tailored to the application. Therefore, CORILGA repository has been enhanced with an automatic alignment tool, available to the administrator of the repository, that aligns speech with an orthographic transcription. In the event that no transcription, or just a partial one, were available, a speech recognizer for Galician is used to generate word and phonetic segmentations. These recognized outputs may contain errors that will have to be manually corrected by the administrator. For assisting this task, the tool also provides an ELAN tier with the confidence measure of each recognized word. In this paper, after the description of the main facts of the CORILGA corpus, the speech alignment and recognition tools are described. Both have been developed using the Kaldi toolkit.
2014
pdf
abs
Introducing a Framework for the Evaluation of Music Detection Tools
Paula Lopez-Otero
|
Laura Docio-Fernandez
|
Carmen Garcia-Mateo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The huge amount of multimedia information available nowadays makes its manual processing prohibitive, requiring tools for automatic labelling of these contents. This paper describes a framework for assessing a music detection tool; this framework consists of a database, composed of several hours of radio recordings that include different types of radio programmes, and a set of evaluation measures for evaluating the performance of a music detection tool in detail. A tool for automatically detecting music in audio streams, with application to music information retrieval tasks, is presented as well. The aim of this tool is to discard the audio excerpts that do not contain music in order to avoid their unnecessary processing. This tool applies fingerprinting to different acoustic features extracted from the audio signal in order to remove perceptual irrelevancies, and a support vector machine is trained for classifying these fingerprints in classes music and no-music. The validity of this tool is assessed in the proposed evaluation framework.
pdf
abs
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm
|
Hans Uszkoreit
|
Sophia Ananiadou
|
Núria Bel
|
Audronė Bielevičienė
|
Lars Borin
|
António Branco
|
Gerhard Budin
|
Nicoletta Calzolari
|
Walter Daelemans
|
Radovan Garabík
|
Marko Grobelnik
|
Carmen García-Mateo
|
Josef van Genabith
|
Jan Hajič
|
Inma Hernáez
|
John Judge
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Joseph Mariani
|
John McNaught
|
Maite Melero
|
Monica Monachini
|
Asunción Moreno
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Stelios Piperidis
|
Adam Przepiórkowski
|
Eiríkur Rögnvaldsson
|
Michael Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Koenraad De Smedt
|
Marko Tadić
|
Paul Thompson
|
Dan Tufiş
|
Tamás Váradi
|
Andrejs Vasiļjevs
|
Kadri Vider
|
Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.
pdf
abs
CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis
Carmen García-Mateo
|
Antonio Cardenal
|
Xosé Luis Regueira
|
Elisa Fernández Rei
|
Marta Martinez
|
Roberto Seara
|
Rocío Varela
|
Noemí Basanta
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the CORILGA (Corpus Oral Informatizado da Lingua Galega). CORILGA is a large high-quality corpus of spoken Galician from the 1960s up to present-day, including both formal and informal spoken language from both standard and non-standard varieties, and across different generations and social levels. The corpus will be available to the research community upon completion. Galician is one of the EU languages that needs further research before highly effective language technology solutions can be implemented. A software repository for speech resources in Galician is also described. The repository includes a structured database, a graphical interface and processing tools. The use of a database enables to perform search in a simple and fast way based in a number of different criteria. The web-based user interface facilitates users the access to the different materials. Last but not least a set of transcription-based modules for automatic speech recognition has been developed, thus facilitating the orthographic labelling of the recordings.
2010
pdf
abs
Building High Quality Databases for Minority Languages such as Galician
Francisco Campillo
|
Daniela Braga
|
Ana Belén Mourín
|
Carmen García-Mateo
|
Pedro Silva
|
Miguel Sales Dias
|
Francisco Méndez
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes the result of a joint R&D project between Microsoft Portugal and the Signal Theory Group of the University of Vigo (Spain), where a set of language resources was developed with application to Text―to―Speech synthesis. First, a large Corpus of 10000 Galician sentences was designed and recorded by a professional female speaker. Second, a lexicon with phonetic and grammatical information of over 90000 entries was collected and reviewed manually by a linguist expert. And finally, these resources were used for a MOS (Mean Opinion Score) perceptual test to compare two state―of―the―art speech synthesizers of both groups, the one from Microsoft based on HMM, and the one from the University of Vigo based on unit selection.
2004
pdf
The COST278 Pan-European Broadcast News Database
An Vandecatseye
|
Jean-Pierre Martens
|
Joao Neto
|
Hugo Meinedo
|
Carmen Garcia-Mateo
|
Javier Dieguez
|
France Mihelic
|
Janez Zibert
|
Jan Nouza
|
Petr David
|
Matus Pleva
|
Anton Cizmar
|
Harris Papageorgiou
|
Christina Alexandris
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Transcrigal: A Bilingual System for Automatic Indexing of Broadcast News
Carmen Garcia-Mateo
|
Javier Dieguez-Tirado
|
Laura Docio-Fernandez
|
Antonio Cardenal-Lopez
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2002
pdf
Acoustic Modeling and Training of a Bilingual ASR System when a Minority Language is Involved
Laura Docío-Fernández
|
Carmen García-Mateo
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)