Emmanuel Schang


2024

pdf
Technologies de la parole et données de terrain : le cas du créole haïtien
William N. Havard | Renauld Govain | Daphne Gonçalves Teixeira | Benjamin Lecouteux | Emmanuel Schang
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Nous utilisons des données de terrain en créole haïtien, récoltées il y a $40$ ans sur cassettes puis numérisées, pour entraîner un modèle natif d’apprentissage auto-supervisé (SSL) de la parole (Wav2Vec2) en haïtien. Nous utilisons une approche de pré-entraînement continu (CPT) sur des modèles SSL pré-entraînés de deux langues étrangères : la langue lexificatrice – le français – et une langue non apparentée – l’anglais. Nous comparons les performances de ces trois modèles SSL, et de deux autres modèles SSL étrangers directement affinés, sur une tâche de reconnaissance de la parole. Nos résultats montrent que le modèle le plus performant est celui qui a été entraîné en utilisant une approche CPT sur la langue lexificatrice, suivi par le modèle natif. Nous concluons que l’approche de ”mobilisation des archives” préconisée par (Bird, 2020) est une voie prometteuse pour concevoir des technologies vocales pour de nouvelles langues.

2023

pdf bib
Application of Speech Processes for the Documentation of Kréyòl Gwadloupéyen
Éric Le Ferrand | Fabiola Henri | Benjamin Lecouteux | Emmanuel Schang
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

In recent times, there has been a growing number of research studies focused on addressing the challenges posed by low-resource languages and the transcription bottleneck phenomenon. This phenomenon has driven the development of speech recognition methods to transcribe regional and Indigenous languages automatically. Although there is much talk about bridging the gap between speech technologies and field linguistics, there is a lack of documented efficient communication between NLP experts and documentary linguists. The models created for low-resource languages often remain within the confines of computer science departments, while documentary linguistics remain attached to traditional transcription workflows. This paper presents the early stage of a collaboration between NLP experts and field linguists, resulting in the successful transcription of Kréyòl Gwadloupéyen using speech recognition technology.

2022

pdf
Automatic Speech Recognition and Query By Example for Creole Languages Documentation
Cécile Macaire | Didier Schwab | Benjamin Lecouteux | Emmanuel Schang
Findings of the Association for Computational Linguistics: ACL 2022

We investigate the exploitation of self-supervised models for two Creole languages with few resources: Gwadloupéyen and Morisien. Automatic language processing tools are almost non-existent for these two languages. We propose to use about one hour of annotated data to design an automatic speech recognition system for each language. We evaluate how much data is needed to obtain a query-by-example system that is usable by linguists. Moreover, our experiments show that multilingual self-supervised models are not necessarily the most efficient for Creole languages.

2017

pdf
Temporal@ODIL project: Adapting ISO-TimeML to syntactic treebanks for the temporal annotation of spoken speech
Jean-Yves Antoine | Jakub Wasczuk | Anaïs Lefeuvre-Haftermeyer | Lotfi Abouda | Emmanuel Schang | Agata Savary
Proceedings of the 13th Joint ISO-ACL Workshop on Interoperable Semantic Annotation (ISA-13)

2016

pdf
Covering various Needs in Temporal Annotation: a Proposal of Extension of ISO TimeML that Preserves Upward Compatibility
Anaïs Lefeuvre-Halftermeyer | Jean-Yves Antoine | Alain Couillault | Emmanuel Schang | Lotfi Abouda | Agata Savary | Denis Maurel | Iris Eshkol | Delphine Battistelli
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports a critical analysis of the ISO TimeML standard, in the light of several experiences of temporal annotation that were conducted on spoken French. It shows that the norm suffers from weaknesses that should be corrected to fit a larger variety of needs inNLP and in corpus linguistics. We present our proposition of some improvements of the norm before it will be revised by the ISO Committee in 2017. These modifications concern mainly (1) Enrichments of well identified features of the norm: temporal function of TIMEX time expressions, additional types for TLINK temporal relations; (2) Deeper modifications concerning the units or features annotated: clarification between time and tense for EVENT units, coherence of representation between temporal signals (the SIGNAL unit) and TIMEX modifiers (the MOD feature); (3) A recommendation to perform temporal annotation on top of a syntactic (rather than lexical) layer (temporal annotation on a treebank).

2014

pdf
ANCOR_Centre, a large free spoken French coreference corpus: description of the resource and reliability measures
Judith Muzerelle | Anaïs Lefeuvre | Emmanuel Schang | Jean-Yves Antoine | Aurore Pelletier | Denis Maurel | Iris Eshkol | Jeanne Villaneau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents ANCOR_Centre, a French coreference corpus, available under the Creative Commons Licence. With a size of around 500,000 words, the corpus is large enough to serve the needs of data-driven approaches in NLP and represents one of the largest coreference resources currently available. The corpus focuses exclusively on spoken language, it aims at representing a certain variety of spoken genders. ANCOR_Centre includes anaphora as well as coreference relations which involve nominal and pronominal mentions. The paper describes into details the annotation scheme and the reliability measures computed on the resource.

pdf
Tense and Time Annotations : a Contribution to TimeML Improvement (Annotation de la temporalité en corpus : contribution à l’amélioration de la norme TimeML) [in French]
Anaïs Lefeuvre | Jean-Yves Antoine | Agata Savary | Emmanuel Schang | Lotfi Abouda | Denis Maurel | Iris Eshkol
Proceedings of TALN 2014 (Volume 2: Short Papers)

2013

pdf
ANCOR, the first large French speaking corpus of conversational speech annotated in coreference to be freely available (ANCOR, premier corpus de français parlé d’envergure annoté en coréférence et distribué librement) [in French]
Judith Muzerelle | Anaïs Lefeuvre | Jean-Yves Antoine | Emmanuel Schang | Denis Maurel | Jeanne Villaneau | Iris Eshkol
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf
Décrire la morphologie des verbes en ikota au moyen d’une métagrammaire (Describing the Morphology of Verbs in Ikota using a Metagrammar) [in French]
Denys Duchier | Brunelle Magnana Ekoukou | Yannick Parmentier | Simon Petitjean | Emmanuel Schang
JEP-TALN-RECITAL 2012, Workshop TALAf 2012: Traitement Automatique des Langues Africaines (TALAf 2012: African Language Processing)

pdf
Describing São Tomense Using a Tree-Adjoining Meta-Grammar
Emmanuel Schang | Denys Duchier | Brunelle Magnana Ekoukou | Yannick Parmentier | Simon Petitjean
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)