Emmanuel Schang

2025

pdf bib abs
Speech Technologies with Fieldwork Recordings: the Case of Haitian Creole
William N. Havard | Renauld Govain | Benjamin Lecouteux | Emmanuel Schang
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

We use 40-year-old digitalised tape-recorded fieldwork data in Haitian Creole to train a native self-supervised learning (SSL) model of speech representation (WAV2VEC2). We also use a continued pre-training approach on pre-trained SSL models of two foreign languages the lexifier language – French – and an unrelated language – English. We compare the performances of these three SSL models, and of two other foreign SSL models directly finetuned, on an ASR task, where all five models are fine-tuned on transcribed fieldwork recordings in Haitian Creole. Our results show the best-performing model is the one trained using a continued pre-training approach on the lexifier language, followed by the native model. We conclude that the ‘mobilising the archive’-approach advocated by (Bird, 2020) is a promising way forward to design speech technologies for new languages.

2024

pdf bib abs
Technologies de la parole et données de terrain : le cas du créole haïtien
William N. Havard | Renauld Govain | Daphne Gonçalves Teixeira | Benjamin Lecouteux | Emmanuel Schang
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Nous utilisons des données de terrain en créole haïtien, récoltées il y a $40$ ans sur cassettes puis numérisées, pour entraîner un modèle natif d’apprentissage auto-supervisé (SSL) de la parole (Wav2Vec2) en haïtien. Nous utilisons une approche de pré-entraînement continu (CPT) sur des modèles SSL pré-entraînés de deux langues étrangères : la langue lexificatrice – le français – et une langue non apparentée – l’anglais. Nous comparons les performances de ces trois modèles SSL, et de deux autres modèles SSL étrangers directement affinés, sur une tâche de reconnaissance de la parole. Nos résultats montrent que le modèle le plus performant est celui qui a été entraîné en utilisant une approche CPT sur la langue lexificatrice, suivi par le modèle natif. Nous concluons que l’approche de ”mobilisation des archives” préconisée par (Bird, 2020) est une voie prometteuse pour concevoir des technologies vocales pour de nouvelles langues.

2023

pdf bib abs
Application of Speech Processes for the Documentation of Kréyòl Gwadloupéyen
Éric Le Ferrand | Fabiola Henri | Benjamin Lecouteux | Emmanuel Schang
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

In recent times, there has been a growing number of research studies focused on addressing the challenges posed by low-resource languages and the transcription bottleneck phenomenon. This phenomenon has driven the development of speech recognition methods to transcribe regional and Indigenous languages automatically. Although there is much talk about bridging the gap between speech technologies and field linguistics, there is a lack of documented efficient communication between NLP experts and documentary linguists. The models created for low-resource languages often remain within the confines of computer science departments, while documentary linguistics remain attached to traditional transcription workflows. This paper presents the early stage of a collaboration between NLP experts and field linguists, resulting in the successful transcription of Kréyòl Gwadloupéyen using speech recognition technology.

2022

pdf bib abs
Automatic Speech Recognition and Query By Example for Creole Languages Documentation
Cécile Macaire | Didier Schwab | Benjamin Lecouteux | Emmanuel Schang
Findings of the Association for Computational Linguistics: ACL 2022

We investigate the exploitation of self-supervised models for two Creole languages with few resources: Gwadloupéyen and Morisien. Automatic language processing tools are almost non-existent for these two languages. We propose to use about one hour of annotated data to design an automatic speech recognition system for each language. We evaluate how much data is needed to obtain a query-by-example system that is usable by linguists. Moreover, our experiments show that multilingual self-supervised models are not necessarily the most efficient for Creole languages.

2017

pdf bib
Temporal@ODIL project: Adapting ISO-TimeML to syntactic treebanks for the temporal annotation of spoken speech
Jean-Yves Antoine | Jakub Wasczuk | Anaïs Lefeuvre-Haftermeyer | Lotfi Abouda | Emmanuel Schang | Agata Savary
Proceedings of the 13th Joint ISO-ACL Workshop on Interoperable Semantic Annotation (ISA-13)

2016

pdf bib abs
Covering various Needs in Temporal Annotation: a Proposal of Extension of ISO TimeML that Preserves Upward Compatibility
Anaïs Lefeuvre-Halftermeyer | Jean-Yves Antoine | Alain Couillault | Emmanuel Schang | Lotfi Abouda | Agata Savary | Denis Maurel | Iris Eshkol | Delphine Battistelli
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports a critical analysis of the ISO TimeML standard, in the light of several experiences of temporal annotation that were conducted on spoken French. It shows that the norm suffers from weaknesses that should be corrected to fit a larger variety of needs inNLP and in corpus linguistics. We present our proposition of some improvements of the norm before it will be revised by the ISO Committee in 2017. These modifications concern mainly (1) Enrichments of well identified features of the norm: temporal function of TIMEX time expressions, additional types for TLINK temporal relations; (2) Deeper modifications concerning the units or features annotated: clarification between time and tense for EVENT units, coherence of representation between temporal signals (the SIGNAL unit) and TIMEX modifiers (the MOD feature); (3) A recommendation to perform temporal annotation on top of a syntactic (rather than lexical) layer (temporal annotation on a treebank).

2014

pdf bib
Tense and Time Annotations : a Contribution to TimeML Improvement (Annotation de la temporalité en corpus : contribution à l’amélioration de la norme TimeML) [in French]
Anaïs Lefeuvre | Jean-Yves Antoine | Agata Savary | Emmanuel Schang | Lotfi Abouda | Denis Maurel | Iris Eshkol
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib abs
ANCOR_Centre, a large free spoken French coreference corpus: description of the resource and reliability measures
Judith Muzerelle | Anaïs Lefeuvre | Emmanuel Schang | Jean-Yves Antoine | Aurore Pelletier | Denis Maurel | Iris Eshkol | Jeanne Villaneau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents ANCOR_Centre, a French coreference corpus, available under the Creative Commons Licence. With a size of around 500,000 words, the corpus is large enough to serve the needs of data-driven approaches in NLP and represents one of the largest coreference resources currently available. The corpus focuses exclusively on spoken language, it aims at representing a certain variety of spoken genders. ANCOR_Centre includes anaphora as well as coreference relations which involve nominal and pronominal mentions. The paper describes into details the annotation scheme and the reliability measures computed on the resource.

Emmanuel Schang

2025

2024

2023

2022

2017

2016

2014

2013

2012

Co-authors

Venues