2024
pdf
abs
ALADAN at IWSLT24 Low-resource Arabic Dialectal Speech Translation Task
Waad Ben Kheder
|
Josef Jon
|
André Beyer
|
Abdel Messaoudi
|
Rabea Affan
|
Claude Barras
|
Maxim Tychonov
|
Jean-Luc Gauvain
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
This paper presents ALADAN’s approach to the IWSLT 2024 Dialectal and Low-resource shared task, focusing on Levantine Arabic (apc) and Tunisian Arabic (aeb) to English speech translation (ST). Addressing challenges such as the lack of standardized orthography and limited training data, we propose a solution for data normalization in Dialectal Arabic, employing a modified Levenshtein distance and Word2vec models to find orthographic variants of the same word. Our system consists of a cascade ST system integrating two ASR systems (TDNN-F and Zipformer) and two NMT modules derived from pre-trained models (NLLB-200 1.3B distilled model and CohereAI’s Command-R). Additionally, we explore the integration of unsupervised textual and audio data, highlighting the importance of multi-dialectal datasets for both ASR and NMT tasks. Our system achieves BLEU score of 31.5 for Levantine Arabic on the official validation set.
2022
pdf
abs
Bazinga! A Dataset for Multi-Party Dialogues Structuring
Paul Lerner
|
Juliette Bergoënd
|
Camille Guinaudeau
|
Hervé Bredin
|
Benjamin Maurice
|
Sharleyne Lefevre
|
Martin Bouteiller
|
Aman Berhe
|
Léo Galmant
|
Ruiqing Yin
|
Claude Barras
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self- or weakly-supervised learning methods.
pdf
Survey on Narrative Structure: from Linguistic Theories to Automatic Extraction Approaches
Aman Berhe
|
Camille Guinaudeau
|
Claude Barras
Traitement Automatique des Langues, Volume 63, Numéro 1 : Varia [Varia]
2016
pdf
abs
Benchmarking multimedia technologies with the CAMOMILE platform: the case of Multimodal Person Discovery at MediaEval 2015
Johann Poignant
|
Hervé Bredin
|
Claude Barras
|
Mickael Stefas
|
Pierrick Bruneau
|
Thomas Tamisier
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper, we claim that the CAMOMILE collaborative annotation platform (developed in the framework of the eponymous CHIST-ERA project) eases the organization of multimedia technology benchmarks, automating most of the campaign technical workflow and enabling collaborative (hence faster and cheaper) annotation of the evaluation data. This is demonstrated through the successful organization of a new multimedia task at MediaEval 2015, Multimodal Person Discovery in Broadcast TV.
pdf
abs
The CAMOMILE Collaborative Annotation Platform for Multi-modal, Multi-lingual and Multi-media Documents
Johann Poignant
|
Mateusz Budnik
|
Hervé Bredin
|
Claude Barras
|
Mickael Stefas
|
Pierrick Bruneau
|
Gilles Adda
|
Laurent Besacier
|
Hazim Ekenel
|
Gil Francopoulo
|
Javier Hernando
|
Joseph Mariani
|
Ramon Morros
|
Georges Quénot
|
Sophie Rosset
|
Thomas Tamisier
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.
2014
pdf
abs
TVD: A Reproducible and Multiply Aligned TV Series Dataset
Anindya Roy
|
Camille Guinaudeau
|
Hervé Bredin
|
Claude Barras
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We introduce a new dataset built around two TV series from different genres, The Big Bang Theory, a situation comedy and Game of Thrones, a fantasy drama. The dataset has multiple tracks extracted from diverse sources, including dialogue (manual and automatic transcripts, multilingual subtitles), crowd-sourced textual descriptions (brief episode summaries, longer episode outlines) and various metadata (speakers, shots, scenes). The paper describes the dataset and provide tools to reproduce it for research purposes provided one has legally acquired the DVD set of the series. Tools are also provided to temporally align a major subset of dialogue and description tracks, in order to combine complementary information present in these tracks for enhanced accessibility. For alignment, we consider tracks as comparable corpora and first apply an existing algorithm for aligning such corpora based on dynamic time warping and TFIDF-based similarity scores. We improve this baseline algorithm using contextual information, WordNet-based word similarity and scene location information. We report the performance of these algorithms on a manually aligned subset of the data. To highlight the interest of the database, we report a use case involving rich speech retrieval and propose other uses.
2008
pdf
abs
Annotation and analysis of overlapping speech in political interviews
Martine Adda-Decker
|
Claude Barras
|
Gilles Adda
|
Patrick Paroubek
|
Philippe Boula de Mareüil
|
Benoit Habert
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Looking for a better understanding of spontaneous speech-related phenomena and to improve automatic speech recognition (ASR), we present here a study on the relationship between the occurrence of overlapping speech segments and disfluencies (filled pauses, repetitions, revisions) in political interviews. First we present our data, and our overlap annotation scheme. We detail our choice of overlapping tags and our definition of disfluencies; the observed ratios of the different overlapping tags are examined, as well as their correlation with of the speaker role and propose two measures to characterise speakers interacting attitude: the attack/resist ratio and the attack density. We then study the relationship between the overlapping speech segments and the disfluencies in our corpus, before concluding on the perspectives that our experiments offer.
2004
pdf
abs
Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts
C. Barras
|
G. Adda
|
M. Adda-Decker
|
B. Habert
|
P. Boula de Mareüil
|
P. Paroubek
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and spoken material and they yield low cost resources for acoustic model training. When automatically transcribing the audio data, regions of agreement between automatic transcripts and written sources allow to transfer time-codes to the written documents: this may be helpful in an audio archive or audio information retrieval environment. Regions of disagreement can be automatically selected for further correction by human transcribers. This study makes use of 10 hours of French radio interview archives with corresponding press-oriented transcripts. The audio corpus has then been transcribed using the LIMSI speech recognizer resulting in automatic transcripts, exhibiting an average word error rate of 12%. 80% of the text corpus (with word chunks of at least five words) can be exactly aligned with the automatic transcripts of the audio data. The residual word error rate on these 80% is less than 1%.
2001
pdf
bib
Invited Talk: Processing Broadcast Audio for Information Access
Jean-Luc Gauvain
|
Lori Lamel
|
Gilles Adda
|
Martine Adda-Decker
|
Claude Barras
|
Langzhou Chen
|
Yannick de Kercadio
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics
2000
pdf
Transcribing with Annotation Graphs
Edouard Geoffrois
|
Claude Barras
|
Steven Bird
|
Zhibiao Wu
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)