Carlos-D. Martínez-Hinarejos

Also published as: Carlos D. Martínez, Carlos D. Martínez Hinarejos, Carlos D. Martínez-Hinarejos

2024

pdf abs
AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies
José-M. Acosta-Triana | David Gimeno-Gómez | Carlos-D. Martínez-Hinarejos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

More than 7,000 known languages are spoken around the world. However, due to the lack of annotated resources, only a small fraction of them are currently covered by speech technologies. Albeit self-supervised speech representations, recent massive speech corpora collections, as well as the organization of challenges, have alleviated this inequality, most studies are mainly benchmarked on English. This situation is aggravated when tasks involving both acoustic and visual speech modalities are addressed. In order to promote research on low-resource languages for audio-visual speech technologies, we present AnnoTheia, a semi-automatic annotation toolkit that detects when a person speaks on the scene and the corresponding transcription. In addition, to show the complete process of preparing AnnoTheia for a language of interest, we also describe the adaptation of a pre-trained model for active speaker detection to Spanish, using a database not initially conceived for this type of task. Prior evaluations show that the toolkit is able to speed up to four times the annotation process. The AnnoTheia toolkit, tutorials, and pre-trained models are available at https://github.com/joactr/AnnoTheia/.

pdf abs
Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition
David Gimeno-Gómez | Carlos-D. Martínez-Hinarejos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.

2022

pdf abs
LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild
David Gimeno-Gómez | Carlos-D. Martínez-Hinarejos
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Speech is considered as a multi-modal process where hearing and vision are two fundamentals pillars. In fact, several studies have demonstrated that the robustness of Automatic Speech Recognition systems can be improved when audio and visual cues are combined to represent the nature of speech. In addition, Visual Speech Recognition, an open research problem whose purpose is to interpret speech by reading the lips of the speaker, has been a focus of interest in the last decades. Nevertheless, in order to estimate these systems in the currently Deep Learning era, large-scale databases are required. On the other hand, while most of these databases are dedicated to English, other languages lack sufficient resources. Thus, this paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish, providing 13 hours of data extracted from Spanish television. Furthermore, baseline results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models, a traditional paradigm that has been widely used in the field of Speech Technologies.

2016

pdf abs
Impact of Automatic Segmentation on the Quality, Productivity and Self-reported Post-editing Effort of Intralingual Subtitles
Aitor Álvarez | Marina Balenciaga | Arantza del Pozo | Haritz Arzelus | Anna Matamala | Carlos-D. Martínez-Hinarejos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes the evaluation methodology followed to measure the impact of using a machine learning algorithm to automatically segment intralingual subtitles. The segmentation quality, productivity and self-reported post-editing effort achieved with such approach are shown to improve those obtained by the technique based in counting characters, mainly employed for automatic subtitle segmentation currently. The corpus used to train and test the proposed automated segmentation method is also described and shared with the community, in order to foster further research in this area.

2010

pdf abs
Evaluation of HMM-based Models for the Annotation of Unsegmented Dialogue Turns
Carlos-D. Martínez-Hinarejos | Vicent Tamarit | José-M. Benedí
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Corpus-based dialogue systems rely on statistical models, whose parameters are inferred from annotated dialogues. The dialogues are usually annotated in terms of Dialogue Acts (DA), and the manual annotation is difficult (as annotation rule are hard to define), error-prone and time-consuming. Therefore, several semi-automatic annotation processes have been proposed to speed-up the process and consequently obtain a dialogue system in less total time. These processes are usually based on statistical models. The standard statistical annotation model is based on Hidden Markov Models (HMM). In this work, we explore the impact of different types of HMM, with different number of states, on annotation accuracy. We performed experiments using these models on two dialogue corpora (Dihana and SwitchBoard) of dissimilar features. The results show that some types of models improve standard HMM in a human-computer task-oriented dialogue corpus (Dihana corpus), but their impact is lower in a human-human non-task-oriented dialogue corpus (SwitchBoard corpus).

2009

pdf
Improving Unsegmented Statistical Dialogue Act Labelling
Vicent Tamarit | Carlos-D. Martínez-Hinarejos | José Miguel Benedí Ruíz
Proceedings of the International Conference RANLP-2009

pdf
A Study of a Segmentation Technique for Dialogue Act Assignation (short paper)
Carlos-D. Martínez-Hinarejos
Proceedings of the Eight International Conference on Computational Semantics

pdf
Simultaneous Dialogue Act Segmentation and Labelling using Lexical and Syntactic Features
Ramon Granell | Stephen Pulman | Carlos-D. Martínez-Hinarejos
Proceedings of the SIGDIAL 2009 Conference

pdf
Improving Unsegmented Dialogue Turns Annotation with N-gram Transducers
Carlos-D. Martínez-Hinarejos | Vicent Tamarit | José-Miguel Benedí
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

2008

pdf abs
Evaluation of several Maximum Likelihood Linear Regression Variants for Language Adaptation
Míriam Luján | Carlos D. Martínez | Vicent Alabau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Multilingual Automatic Speech Recognition (ASR) systems are of great interest in multilingual environments. We studied the case of the Comunitat Valenciana where the two official languages are Spanish and Valencian. These two languages share most of their phonemes, and their syntax and vocabulary are also quite similar since they have influenced each other for many years. We constructed a system, and trained its acoustic models with a small corpus of Spanish and Valencian, which has produced poor results due to the lack of data. Adaptation techniques can be used to adapt acoustic models that are trained with a large corpus of a language inr order to obtain acoustic models for a phonetically similar language. This process is known as language adaptation. The Maximum Likelihood Linear Regression (MLLR) technique has commonly been used in speaker adaptation; however we have used MLLR in language adaptation. We compared several MLLR variants (mean square, diagonal matrix and full matrix) for language adaptation in order to choose the best alternative for our system.

pdf abs
Evaluation of Different Segmentation Techniques for Dialogue Turns
Carlos D. Martínez-Hinarejos | Vicent Tamarit
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In dialogue systems, it is necessary to decode the user input into semantically meaningful units. These semantical units, usually Dialogue Acts (DA), are used by the system to produce the most appropriate response. The user turns can be segmented into utterances, which are meaningful segments from the dialogue viewpoint. In this case, a single DA is associated to each utterance. Many previous works have used DA assignation models on segmented dialogue corpora, but only a few have tried to perform the segmentation and assignation at the same time. The knowledge of the segmentation of turns into utterances is not common in dialogue corpora, and knowing the quality of the segmentations provided by the models that simultaneously perform segmentation and assignation would be interesting. In this work, we evaluate the accuracy of the segmentation offered by this type of model. The evaluation is done on a Spanish dialogue system on a railway information task. The results reveal that one of these techniques provides a high quality segmentation for this corpus.

2007

pdf
On the Training Data Requirements for an Automatic Dialogue Annotation Technique
Carlos D. Martínez-Hinarejos
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2006

pdf abs
Bilingual speech corpus in two phonetically similar languages
Vicente Alabau | Carlos D. Martínez
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

As Speech Recognition Systems improve, they become suitable for facingnew problems. Multilingual speech recognition is one such problems. In the present work, the case of the Comunitat Valenciana multilingual environment is studied. The official languages in the Comunitat Valenciana (Spanish and Valencian) share most of their acoustic units, and their vocabularies and syntax are quite similar. They have influenced each other for many years.A small corpus on an Information System task was developed for experimentationpurposes.This choice will make it possible to develop a working prototype in the future,and it is simple enough to build semi-automatic language models. The design of the acoustic corpus is discussed, showing that all combinations of accents have been studied (native, non-native speakers, male, female, etc.).

pdf
Segmented and Unsegmented Dialogue-Act Annotation with Statistical Dialogue Models
Carlos D. Martínez Hinarejos | Ramón Granell | José Miguel Benedí
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions