Rayan Ziane
2026
Radio Haiti-Inter: A Large-Scale Annotated Corpus of Spoken Haitian Creole
William N. Havard | Rayan Ziane | Mélissa Menclé | Maximin Coavoux | Benjamin Lecouteux | Emmanuel Schang
Proceedings of the Fifteenth Language Resources and Evaluation Conference
William N. Havard | Rayan Ziane | Mélissa Menclé | Maximin Coavoux | Benjamin Lecouteux | Emmanuel Schang
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present the first large-scale corpus of spoken Haitian Creole (Kreyòl), namely Radio Haiti-Inter. The corpus was constructed using automatic speech recognition (ASR) with a state-of-the-art model specifically dedicated to Kreyòl. In addition to transcriptions, we provide part-of-speech (POS) tags, as well as time-aligned transcripts and confidence scores, enabling users to select the most reliable segments for their research. We conduct a manual evaluation of both the transcription quality and POS tagging accuracy to assess the reliability of the resource we present. To enable high-quality research with the resource we introduce, we are releasing 50 hours, comprising both the audios and attached annotations, drawn from the highest-quality segments. This corpus represents an invaluable resource for advancing the study of Kreyòl, with potential applications in phonetics, phonology, morphology, syntax, as well as the study of code-switching and code-mixing. As the recordings cover a large span of years, the corpus we introduce is also suited to micro-diachronic studies of Kreyòl.
2025
Explicit Edge Length Coding to Improve Long Sentence Parsing Performance
Khensa Daoudi | Mathieu Dehouck | Rayan Ziane | Natasha Romanova
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Khensa Daoudi | Mathieu Dehouck | Rayan Ziane | Natasha Romanova
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Performance of syntactic parsers is reduced for longer sentences. While some of this reduction can be explained by the tendency of longer sentences to be more syntactically complex as well as the increase of candidate governor number, some of it is due to longer sentences being more challenging to encode. This is especially relevant for low-resource scenarios such as parsing of written sources in historical languages (e.g. medieval and early-modern European languages), in particular legal texts, where sentences can be very long whereas the amount of training material remains limited. In this paper, we present a new method for explicitly using the arc length information in order to bias the scores produced by a graph-based parser. With a series of experiments on Norman and Gascon data, in which we divide the test data according to sentence length, we show that indeed explicit length coding is beneficial to retain parsing performance for longer sentences.